TSM - Apache Spark : make Big Data simple

Tudor Lăpușan - Java & Big Data developer @ Telenav

In the last year, Apache Spark received a lot of attention in the Big Data and Data Science fields mainly because it has easier, more friendly API and better memory management than MapReduce, so developers could concentrate on the logical operations of the data computation rather than the details of how computation is executed behind.

The goals of this article is to introduce the main Spark concepts and to show us how easy it can be to start learning and writing code in Spark.

     The content of the article about :

-       Apache Spark history

-       From where to learn more about Spark ?

-       Spark architecture

-       Easy ways to run Spark

-       Supported languages

-       RDD

-       Spark operations : transformations and actions

        

Apache Spark history

Apache Spark was initially started in 2009 at UC Berkeley AMPLab by two Romanians, Matei Zaharia and Ionel Stoica and in 2013, the project was donated to Apache Software Foundation under the license Apache 2.0.

    Spark goals were to provide an easier, more friendly API and a better memory management than MapReduce, so developers could concentrate on the logical operations of the computation rather than the details of how it is executed behind.

    All of these give us the impression that we are not working with a distributed framework, make the code more understandable, maintainable and speed up the development time.

Where can we learn more about Spark ?

There are a lot of tutorials on the Internet from where to learn Spark, including this one.

   But if you want to allocate more of your free time to better understand the Spark functionalities, I would recommend reading "Learning Spark" book.

   It is an easy book to read, it takes your from zero, by describing all the functionalities of Spark and contains a lot of practical examples in Java, Scala and Python programming languages.

   Other good materials to learn Spark from are its own public documentation . The main advantage is that this documentation contains the latest features of Spark. For example, at the moment of writing this article, the book "Learning Spark" is based on Spark 1.3, while its latest documentation is at Spark 1.6.

  If you really want to learn Spark, I recommend to start with the book and, only then, read about the latest features from the existing documentation. I would prefer starting with the book because it gives you the information in the right order and because it is very well explained.

 

Spark architecture

  In my opinion, here is the true value of Spark.

  Learning Spark core basics, give you the possibility to work with multiple use-cases, like batch processing, machine learning, SQL, streaming processing and graph processing. Even better, you can combine them, for example, you can combine batch code with SQL code in the same Java class file. Just imagine how many lines of Java code would be needed to complete this SQL statement "select * from user_table where location = "US" order by age desc".

The Hadoop ecosystem is very large and contains different frameworks for particular use-cases. Hive for SQL, MapReduce for batch processing, Giraph for graph processing, Storm for real-time processing, Mahout for machine learning. If you want to learn/work with as many of these use-cases as possible, you somehow have to learn a new framework each time. Spark comes equipped with all of these use-cases, so the learning period to switch from batch processing (Spark Core) to real-time processing(Spark Streaming) is lower than switching from MapReduce to Storm, for example.

Easy ways to run Spark

The first thoughts you might have before writing some Spark code is that you need a cluster of machines, some Linux skills to create the cluster, so you most probably give up the idea of learning/writing Spark code.

These are some of the easy ways to run your Spark code:

1.Your preferred IDE (ex. IDEA, Eclipse for Java)

You can write and try your code directly from your IDE, without any Spark cluster.

2.Standalone deploy mode.

If you need to access the functionalities of a Spark cluster, you can deploy Spark on a single machine. All you need is to download a pre-build version of Spark, unzip it and run the script sbin/start-all.sh. You can read more details here .

3.Amazon EMR

Using the EMR service, you can deploy a Spark cluster of many machines using just a web page and your mouse. Please read Amazon docs for a complete tutorial . The only drawback is that it will cost you some money.

4.Hadoop vendors

The biggest Hadoop vendors are Cloudera and Hortonworks . To start with Hadoop, they offer virtual machines you can download and install on your own PC. In this way, you have a single Hadoop cluster which will contain the Spark service.

5.Using Docker

You can follow this video tutorial to see how Docker and Zeppelin can help you run Spark examples.

Supported languages

Apache Spark provides native support for Scala, Java, Python and most recently for R language, so a wide range of programmers can try their forces with Spark.

Spark is mainly written in Scala, and runs into a JVM container, so it also has good performance for Java. Python is supported and offers good performance through the smart use of Py4J .

 

RDD

RDD is simply an immutable distributed collection of objects.

Based on the above image, we can imagine the RDD as a collection of Character (for Java), partitioned and distributed on many machines. Partitions are the units of distribution for RDD, like data blocks for HDFS.

RDD is the way through which Spark offers good data management during execution, a feature that MapReduce does not offer.

Spark operations: transformations and actions

Any Spark application consists of two types of operations: transformations and actions. Transformations enable you to create a new RDD from another RDD.

Bellow you can see the diagram which describes ones of the most common transformations : map and filter.

In the case of the map transformation described in the image bellow, Spark iterates, in a distributed way, through each element, and increments the value by one, resulting in a new RDD, named MapRDD.

All transformations in Spark are lazy.   

  

Action computes a result from an RDD.

An example of action is count(), which returns the number of elements from an RDD.

The take(n) action returns an array with the n elements of the RDD and saveAsTextFile() persists the RDD elements on an external data source, like HDFS. 

Enough theory!

During the presentation from the TSM event, I will also show you some Spark code which includes all the examples described in the article.

If you want to learn more about Spark or other BigData frameworks, I invite you to register to BigData/DataScience community from Cluj