Hadoop Ecosystem – How to find your way out there?

Hadoop and HDFS are the pillars of today’s big data systems. They are the cornerstone for almost every big data project.
But in order to use them properly and efficiently, it needs to know their ecosystem.
Today there are dozens of system that together build a huge ecosystem for Hadoop and HDFS.

The Hadoop Ecosystem Table lists many of the related projects and the list is growing.

Why are there so many projects and how can we find our way between all of them?

Why are there so many projects?

Let’s first answer the first question: Why are there so many?

Since Hadoop emerged as an open source solution for storing big data tables, many companies adopted it quickly as a big promise to solve their big data issues with their legacy RDBMS.

The promise was big, but the fulfilment no so much.

On the one hand, Hadoop provided what it promised: Hadoop can store a lot of data and that we can write map-reduce jobs to query the data.

However, we all soon realized that Hadoop has very big drawbacks compared to the way we worked with RDBMS

  1. Management tools and GUI tools are missing.
  2. SQL is not provided out of the box.
  3. It can take very long to get results for data queries.
  4. Installation, upgrades and housekeeping is cumbersome.
  5. Difficult to integrate with programs for DB operations

In order to resolve the issues raised above, many projects started to emerge. But as there is no one owner to Hadoop, there was no order in the way those projects were written and in many cases projects competed other projects as doing the same thing better.

Define our needs

Now to the second question: How can we find our way between all Hadoop related projects?

This is not an easy one to answer, but this is the goal of my post today, so let’s start.

Before you go to search for solutions, you first need to ask yourself those questions:

Q1: What are your requirements?

Define well what are your business needs. Hadoop and HDFS are mainly addressed for big data storage that can scale horizontally. Do not choose Hadoop and HDFS if this is not your requirement. There are many other NoSql systems that may be better for your needs if you have different requirements.

Q2: What is your expected SLA?

Hadoop and HDFS provides Big Data storage on cluster. this means that you will need to start facing issues of data freshness, eventually consistency, long respond times. Define your system SLA’s for those issues: response time, Consistency or eventually consistency, etc.

Q3: What are you willing to compromise in your requirements?

In regard with the previous question: check carefully what you are willing to compromise in your SLA. You will not be able to get everything! there is no way to get Big Data cluster that can scale with not limits, and that can provide fast queries consistently. There is no magic here!

Classification of projects and systems

This document describes the different projects classified by technical criteria to which they refer.

It does not make your life easier when you want to chose one.

I will try to provide a simpler classification for making some order. The systems are divided here by the following characteristics:

  1. Systems that are based on MapReduce
  2. System that replace MapReduce
  3. Complementary Databases
  4. Utilities

Systems that are based on MapReduce

Originally, The Hadoop provided two parts: The file system: HDFS and the processing mechanism MapReduce. Many projects where built on top of MapReduce to provide simplified API that eventually runs a MapReduce.

Some examples are Hive, Pig and Cascading

The benefit of those systems is that you do not need (almost) anything else then Vanilla Hadoop running on your cluster in order to run them.

The drawback is that those are tied to the MapReduce processing which, in some regards, is pretty slow.

Systems that replace MapReduce

The limitations of MapReduce were clear to many people:

  1. it is slow
  2. It is general purpose processing that may not be good for specific purposes
  3. Its API is complicated to program to.

Several projects started to appear then. All do not replace HDFS as the data layer, but rather suggest better processing mechanisms. Interestingly, all those projects do more or less the same thing, ending up competing each other.

Some examples are Impala, Tez and Spark.

Most of those projects provide better response time than MapReduce. But note that they do not provide a fast results as you expect when querying indexed column in relational database.

Complementary Databases

HDFS is great for storing very big files. But it is not a database. What Database then can we use on top of HDFS or beside it to gain the benefits of HDFS as the storage of the big files?

There are many options here.

There are solutions that use HDFS as their file system and add indexing and management on top of it: HBase and Cassandra (can run without HDFS) are good examples for that.

There are many other NoSQL databases that do not use HDFS at all. The list is big and can be found here and in other places.

A common architecture (Lambda architecture) suggests not to use data in Hadoop directly for reporting, but rather to use other database to hold partial view of the data in Hadoop.

Utilities

There are many projects that build tools or utilities over Hadoop. Some examples are: Hue, Mesos, Mahout and many many others.

In addition, there are large projects that enable the management of Hadoop cluster in a friendly manner. Those are Cloudera Manager and Ambari.

Summary

We showed here that there are many too many projects and systems that are related to Hadoop.

The reason for this big amount of projects is historical and practical: Hadoop grew from its beginning as an open source in the Open Source community, resulting with many involved projects. Because it was so successful, many people wanted to contribute and started new projects.

When you want to choose from this big list the systems that will best fit your project, it is a bit overwhelming in the beginning.

The first step is to define very clear your requirements and SLA’s. most important is to choose where you are OK to compromise SLA and where you cannot.

After that, you look at the different projects and find the ones that best fit your requirements. This may not be a simple task for you, but having good definition of requirement can help you here not to take the wrong approach.