Big Data has become one of today's most frequent buzzwords and the web is full of articles describing it and defining it. But behind all that hype, the paradigms imposed by this buzzword and the technology stacks that enable it, continue both to grow as exponents and as companies who adopt it. Speaking about Big Data and it's technologies, we unequivocally have to speak about Hadoop, since it's the cornerstone for all emerging Big Data technologies.
Hadoop is a top project of the Apache Software Foundation, and has been around for about 10 years now. It started actually as a means to support distribution of another Apache incubator project, a web-crawler named Nutch, and it was inspired by the work done at Google, which was made public in the 2004 article MapReduce: Simplified Data Processing on Large Clusters. Ten years later, this page stands as testimony for the adoption of Hadoop in today's technology market.
The scope of this article, seeing that it is written by a sysadmin, is to shed some light on the installation, configuration and management of a Hadoop cluster and to maybe show the amazing progress that was made in this area, which makes the adoption of such a technology much simpler.
First things first - How it works. This article will focus on the 2.x branch of Hadoop because these days it has the most active community, and it is here where the Big Data ecosystem truly took off.
Hadoop consists of two main components: HDFS and YARN, which are acronyms for Hadoop Distributed File System, respectively Yet Another Resource Negotiator, and, as inferred before, it uses the MapReduce computation paradigm. The use of the said paradigm is the basis of Hadoop, because the system was designed around the concept of data locality in the sense that it is faster to send computation (think jars or even bash scripts) where the data resides, than to bring data (think on the order of terabytes) over the network where said scripts reside. So, from the aforementioned principle, the design decision to have two main subsystems becomes more clear because you need a system to handle the date and one to handle the computation and, as you will see, also the resource management.
Without further ado, let's talk about the HDFS component. HDFS, as the name implies, is a distributed file system which implements a "relaxed" POSIX standard and is designed to be run on commodity hardware. The simplest version of an HDFS cluster consists of two types of daemons: the Namenode and the Datanode in a master/slave architecture. HDFS exposes files which are kept as 128MB blocks on the Datanodes. There is usually only one Namenode in the cluster and its role is to keep the file system metadata. It regulates the user access on the files and it issues operations such as opening, closing and renaming files. It also keeps the mappings of blocks on the Datanodes. Datanodes keep all the blocks of which the file system consists and are responsible for servicing read and write requests.
HDFS was designed with fault tolerance, and it is implemented, as you can see in the image, by replicating blocks between Datanodes, a design choice which also helps in distributing the computation, but more on that later. There are a lot more features which are worth discussing, but this is supposed to be a brief introduction, so I'll mention just one more, which came with the 2.x branch. HDFS can be set up with high availability, which brings the possibility of rolling restarts and also prevents downtime in case of Namenode failure.
The second component of Hadoop, in charge of computation is YARN (sometimes referred to as MapReduce 2.0) and this one is available only in the 2.x branch. YARN splits its functionality between three types of daemons: the ResourceManager, the NodeManager and the ApplicationMaster. The architecture of YARN is also master/slave oriented, because the cluster has only one ResourceManager who services the request for resources (memory, and from 2.6 CPUs) for all the cluster, and the NodeManagers are spread over all the hosts in the cluster. In broader terms, you can think of the ResourceManager as the CPU scheduler of an operating system and the NodeManagers as processes on a system which ask for time on the processor, a view also known as Datacenter of a Computer.
To see all the parts in motion, we can analyze the life cycle of a job submitted to YARN. The computational unit in Yarn is the container, which is a JVM virtual machine. This container can take two functions: it can be a mapper or it can be a reducer. Once submitted, the ResourceManager allocates an ApplicationMaster on one of the nodes in the cluster, which is further responsible for the job execution. The ApplicationMaster negotiates with the ResourceManager for further resources and also with the NodeManagers, which launch the necessary containers for accomplishing the job. Because of the data locality principle, the ApplicationMaster can request that different NodeManagers do the same job (if the NodeManagers reside on nodes which have replicated data). This is useful in large clusters which have a high rate node failure, because it makes the failures transparent to the submitted jobs. NodeManagers also interact with the ResourceManager, but only to report their available resources.
Some features worth mentioning for YARN are the possibility of configuring the cluster with high availability, a possibility which, as of the release of 2.6, also transfers job states for the active ResourceManager to the passive one in case of failover; this means the cluster can operate normally during a rolling restart or an unpredicted downtime, and the presence of queues which enable the cluster to be used by different teams at once with some guarantees.
After a not so brief introduction, we get to the crux of the matter: bringing up a Hadoop cluster and keeping it that way or, if you will, the infrastructure challenge. The infrastructure challenge can be split into three smaller challenges: the install of the cluster, the configuration of the cluster and, last but not least, the maintenance of the cluster. On this front, there are a few competitors whose solutions are aimed at solving this challenge, such as Hortonworks, Cloudera and even Oracle in a manner of speaking. I have chosen the solution offered by Hortonworks, which they dubbed Ambari.
First, the install of the Hadoop cluster. Ambari is designed with master/slave architecture, where the master has a web interface which enables the user to control the cluster and the slaves are located on all the nodes that make up the Hadoop cluster and execute the actions triggered on the master. Installing Ambari is pretty straight forward in the sense that you have to add the repository for your linux distribution and then install the amabri-server package via the package manager. Once installed, you have to configure ssh key authentication from the ambari host to all the nodes in the cluster, and, from here on, your interactions are done via the Ambari web interface. If you have an already configured cluster and need to add more nodes, the interaction is the same, you add the public key on the new server (or deploy via puppet) and thano from the Ambari web interface you just add the node to the cluster and Ambari takes care of the deploy. With that said, the first of the smaller challenges is resolved.
For the configuration part, Ambari exposes all the configuration files present in a Hadoop cluster in its web interface so you can edit them. As of Ambari 2.1.2 it also presents some of the properties in a more user friendly way, through the use of sliders. A really nice feature that alleviates the integration work needed for the cluster is the configuration versioning that Ambari has out of the box. Any modification made to the configurations generates a new version for said configuration, so you can rollback with a couple of clicks, in case the changes in the new configuration break something. That being said, we have cleared the second of the smaller challenges.
Last but not least is the maintenance of the cluster, which Ambari tries to simplify. It enables a granular control of all that is running in the cluster, so you can start it or stop it, but also gives the possibility of starting and stopping larger components like the YARN subsystem or the HDFS one. Any modification to the configuration requires a restart from the corresponding daemons and Ambari marks the ones that run with stale configs and also provides the possibility of restarting only the affected components. Adding a new service (think Spark, Oozie, Scoop) is done from the web interface in a straightforward manner. Starting from 2.0 version, it has an integrated system of monitoring and alerting called ambari-metrics, which also alleviates the integration work which was needed with tools like Nagios and Graphite. The metrics can be customized and organized in dashboards in the ambari web interface. One last perk of Ambari that should be mentioned is that it has support for updating the running Hadoop stack in a rolling manner, making life easier for the cluster maintainer. With this, I consider sub challenge number three done and, with it, the infrastructure challenge is covered.
As a little bonus, I want to mention another great tool existing in the Hadoop ecosystem and that is Hue. Hue is an open source project which aims to simplify the interactions with the cluster from a developer or data scientist point of view. It integrates many of the projects developed over Hadoop into a single interface and enables the developer to work faster and in a more visual manner.
The feature that seems the most useful from a developer standpoint is its integration with oozie. Say you want to use a Hadoop cluster and run a batch job which implies multiple steps like getting data from an external source doing some work over it and that outputting it also in an external source. In a normal Hadoop setup, this would imply some scripts or commands run from the command line or maybe some jobs run from something like Jenkins, but with Hue you can load your scripts on HDFS and create a workflow in a drag and drop manner, which can be saved and run multiple times. It also gives the possibility of scheduling jobs through its coordinator. Some other features that come in handy are a much better file system interaction from the web interface than the one offered by default by hadoop, and also the job control offered from the web interface, which also is better that the default one offered by Hadoop, because it enables you to stop jobs from the interface. Hue can be integrated with LDAP, so each developer who interacts with the cluster through Hue sees only his context meaning his own home directory in HDFS and his own jobs on YARN.
The scope of this article is to give a broad view over the architecture requirements of a Hadoop cluster and to propose a solution in solving them in order to encourage and ease the adoption of these new technologies. It was a personal conviction that maybe people are interested in the hype of Hadoop and MapReduce but are intimidated by the infrastructure requirements for running a cluster; but with the help of the ecosystem that has formed around Hadoop, those requirements have become easier to meet.