Streaming your data

So, you want to process data that flows and you want to show information based on this data in real-time, right?

What are the available options you have?

Background

First, let’s cover the components that you will need:

Figure 1. Flow of Data

Data Streaming & Processing

As seen in figure 1, the Data Streaming & Processing is only one part in the architecture.

In this post I will focus only on this part, but will suggest what are the available tools that may fill in for the rest of the components:

Data Source: This is any source of data that may exist. Known tools to send the data into the Processing component are Apache Kafka or Amazon Kinesis.

Analytical Database: Any database that applies you need for aggregating the information for the UI. Some suggestions may be:  Cassandra, HBase, Elasticsearch, Couchbase, etc.

Front End: This will generally be a web server that serves a web application for your use.

Data Streaming and Processing engine

Now we are at the heart of this post.

What are the available Data Streaming and Processing engines that exist and what is the difference between them?

Requirement:

Those are the expected abilities of the Stream processor:

  1. Distributed system with ability to scale linearly.
  2. Fault Tolerance and Durability (Consistency)
  3. Fast!
  4. Resource Management and rebalance resources
  5. Mature enough

Note: For item #1 and #2, according to the CAP theorem, it may be difficult to gain both of them. Therefore, a system that will allow to choose between the two based on the expected SLA will be a good catch!

Two approaches

We can find today two different approaches that can be adapted

  1. Real-time streaming infrastructures. Data streams between processing units via network (Apache Storm, Apache Flink)
  2. Micro-batch infrastructure data is persisted between the hops (Spark Streaming)

There is also a third approach that combines some features from each approach. Apache Samza streams data in real-time, but keeps the intermediates between the hops persisted.

In this post, I will cover the following tools:

Apache Storm

Storm was the first real-time event processing framework that was release ans that addressed most items in our requirements list. It was developed by Nathan Marz and BackType team and then acquired and open sourced in Twitter.

It is also the most adopted system as can be seen here.

Let’s check if Storm addresses all our needs:

Distributed System: Yes. Storm is certainly a distributed system that can scale. I will not go into the architecture of Storm here.  I will only mention that Storm is built on small runnable components that are called Spouts and Bolts. Tuples are passed between the Spouts and Bolts via ZeroMQ/Netty. An application can grow in scale by setting more bolts or by setting more parallelism. For more information please refer to Storm home page.

Fault Tolerance Vs. Fast streaming: Here, there was a lot of work done to allow using a more durable application or a faster application, with a penalty of latency.

So on one side, Storm provided a very good configurable way to address your SLA.

On the other side, when working with Storm you may reach points of not durable enough and trying to be more durable will have a latency cost.

In General, it needs a lot of tuning to adjust the best configuration for your application.

Resource Management: Here again, Storm does not make it easy for you. As real-time data may have daily and seasonal peaks, you may need to adjust your resources to be sure you do not lag behind.

There is however a good initiative of Mesos for Storm that make your like much easier in this regard!

Spark Streaming

Spark was initiated by Matei Zaharia at UC Berkeley. Originally this project meant to address some issues in Hadoop like latency and to improve machine learning capabilities. Later on it was taken by Databricks and added to the Apache project.

As Spark idea grew with Hadoop in  mind, the batch processing was the backbone of it. The core of Spark is Resilient Distributed Dataset (RDD), which is is a logical collection of data partitioned across machines.

Spark streaming was added after the project started to grow significantly.

This went very well with the Lambda Architecture. A concept that was spoken a lot but was not implemented very much. With Spark, you can gain both offline batch processing and real-time streaming in one platform.

So with spark, that is based on batch processing, streaming is in fact still batch processing and not real time streaming. The idea is that if you shorten your batches very much until they become micro-batches, you get a latency close to what real-time streaming suggest.

As latency may be a problem, the durability of batch  is superior over real-time  processing. Fault tolerance is easier to maintain and resource utilization is more efficient.

Apache Samza

Samza was developed in LinkedIn and was aimed to address the pitfalls of Storm.

There is some similarity between Samza and Storm, but some important differences.

In general, the architecture is similar: A distributed system that can scale horizontally; Streams and Tasks replace Spouts and Bolts.

Fault Tolerance Vs. Fast streaming: In this aspect, Samza took the Consistency approach over fast streaming. the messages that are sent are always persisted to disk. Therefore the issue of bottlenecks that may occur due to temporarily high throughput is solved. But the whole throughput may take a bit longer, and it is dependent on the number of hops in the way of the message.

A good comparison between Storm and Samza can be found here.

Samza comes with Yarn integration, so the Resource Management should be well addressed.

Choosing Streaming framework

When coming to decide which streaming framework to choose for a new project, there may be some guidelines.

First: define your expected SLA (Service Level Agreement).

A distributed system is not easy to manage and to monitor, and there are trade-offs that you need to decide about

Low Latency

If you need a very low latency system that can get you responses in milliseconds or few seconds, then storm is preferred. There is also a good pro for Storm, that it lets you manage your data flow in the system in a visual topology map.

High Throughput

Choose Spark Streaming if you can compromise the low latency to few seconds or even minutes, but to know that you can sustain high throughput and unexpected spikes. Another benefit is that you can use the same infrastructure for both batch processing and streaming.

 Conclusion

We saw that there are several good available Cluster base, Open Source, Streaming frameworks today to choose from.

Each one took a different technology path but they all address the same goal: providing an infrastructure that is fault tolerant and can run on clusters, on which you can deploy small processing units that process streaming data.

Storm is quite mature and proved very good performance results.

On the other hand, Spark Streaming has an emerging community and lets you develop durable data frameworks

Samza took a kind of chimeric approach between the two and did not gain lot of popularity yet.

Go Streaming!

 

How to use Avro with Kafka and Storm

Background

In this post I’ll provide a practical example of how to integrate Avro with data flowing from Kafka to Storm

  • Kafka is a highly available high-throughput messaging system from LinkedIn.
  • Storm is adistributed event stream processing system from Twitter.
  • Avro is a data serialization protocol that is based on schema.

All the above are open source projects.

The challenge

There are two main challenges when trying to make Avro data to flow between Kafka and Storm:

  1. How to serialize/deserialize the data of Avro on Kafka
  2. How to manage the changes in protocol that will come when schema evolves.

Code example

Here I provide a simple code sample that demonstrate the use of Avro with Kafka and Storm.

The code can be downloaded from GitHub here: https://github.com/ransilberman/avro-kafka-storm.git

In order to download it, just type:

$ git clone https://github.com/ransilberman/avro-kafka-storm.git

Building from Source

$ mvn install -Dmaven.test.skip=true

Usage:

The project contains two modules: one is AvroMain which is the Kafka Producer test application. Second is AvroStorm which is the Storm Topology code

Basic Usage:

LPEvent.avsc is the Avro Schema. It exists in both projects.

MainTest runs three tests, each of which sends event to Kafka. You may need to change the variable ‘zkConnection’ to your own zookeeper servers list that hold Kafka service brokers.

AvroTopology is the class that runs the Storm Topology. You may need to change the host list zookeeper servers list that holds Kafka service brokers.

First start the topology, then run the tests and the events will pass via Kafka in Avro format

First Test Send event with GenericRecord class.

Second Test Build Avro object using Generated Resources. This test will compile only after the code is generated using the Maven plugin avro-maven-plugin.

Third Test Serializes the Avro objects into file and then reads the file.

upgrade Avro scema version:

AvroTopology2 is a second Storm topology that demonstrates how to use a schema that differs between producer and consumer. The difference in the code is in the line:$ DatumReader<GenericRecord> reader = new GenericDatumReader<GenericRecord>(_schema, _newSchema); Note that there are two parameters to the GenericDatumReader constructor.