TSM - Scalable Data Science with Spark ML Pipelines

Andrei Bâcu - Senior developer @ Paddy Power Betfair

What is Data Science?

The term Data Science is not really that new, and yet it is still raising much debate among scholars and practitioners about what it is, and what it is not. One way to consider data science is as an evolutionary step in interdisciplinary fields like business analysis that incorporate computer science, modeling, statistics, analytics, and mathematics. Two great books that will help you start your Data Science journey are Developing Analytic Talent: Becoming a Data Scientist, by Vincent Granville, and The Field Guide to Data Science, released by the Booz Allen Hamilton company.

In case you missed this, Data Scientist is also called the sexiest job of the 21st century. This role has become unique in industry, government, and other data-driven organizations. More than anything, what data scientists do is make discoveries while swimming in data. A data scientist is also a generalist in business analysis, statistics, and computer science, with expertise in fields such as robustness, algorithm complexity, data visualization, and dashboards. Visually, this is represented with the popular Data Science Venn Diagram in Figure 1 below.

Figure 1. Data Science Venn Diagram

Spark to the rescue!

In what follows, let us see how Apache Spark entered the realm of Data Science, and made it scalable and user-friendly. As you may have read in previous releases of the TSM magazine, Spark is a fast and general engine for large-scale data processing that was developed at U.C. Berkeley by two Romanian guys: Matei Zaharia și Ionel Stoica.

Currently, it is the most active Apache Project, encompassing multiple libraries, as can be seen in Figure 2 below. The latest stable release, up to the present date, is Spark 1.6.1.

Recently, one of the top incubator projects called Spark ML Pipelines, was released as an extension of earlier work on Spark MLlib. This new Spark ML package aims at providing a uniform set of high-level APIs for machine learning algorithms. It is built to enable users to create and tune typical machine learning pipelines. In the latest stable version of Spark, the algorithm coverage includes: Classification, Regression, Collaborative Filtering, Clustering, Dimensionality Reduction, Feature extraction & selection, Model import/export, Pipelines, and Evaluation Metrics.

Figure 2. Spark Libraries

Spark ML Pipeline components

DataFrame

Spark ML uses DataFrame from Spark SQL as an ML dataset, which is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R or Python, but with richer optimizations under the hood. This structure can be built from a variety of data sources: data files, Hive tables, external databases, or existing RDDs. For example, a DataFrame in an ML Pipeline could have different columns storing text, feature vectors, true labels, and predictions.

Transformers

Transformer represents an abstraction that includes feature transformers and learned models. Programmatically speaking, it implements the transform() method, which converts one DataFrame into another, by appending one or more columns, typically called features. For example, an ML learning model might take a DataFrame containing features, predict the label, and output a new DataFrame with predicted labels appended as a new column.

Estimators

An Estimator represents an abstraction of a Machine Learning algorithm that can be trained on input data. Technically, an Estimator implements the fit() method, which accepts a DataFrame and produces a Model, which is also a Transformer. For example, a learning algorithm such as Linear Regression is an Estimator, and calling fit() trains a LinearRegressionModel, which is a Model and hence a Transformer.

Pipeline

Pipeline is specified as a sequence of stages, and each one is either a Transformer or an Estimator. These stages respect the execution order, hence the input DataFrame is transformed as it passes through each stage. An example of machine learning process can be a simple text processing workflow that includes several stages:

This ML Pipeline with three stages is illustrated in Figure 3 below. The components Tokenizer and HashingTF are Transformers, whereas LogisticRegression is an Estimator.

Figure 3. Pipeline used for text processing

However, in this example, a Pipeline is an Estimator. Thus, a Pipeline’s fit() method produces a PipelineModel, which is a Transformer. This PipelineModel can be further used at test time, as can be seen in Figure 4 below.

Figure 4. PipelineModel used for obtaining predictions

Typically,ML Pipelines are complex and they involve multiple interrelated phases, also illustrated in Figure 5 below:

  1. Data ingestion. Spark RDDs or DataFrames can be used to ingest input data.

  2. Feature extraction. By leveraging Spark ML Pipelines, a Transformer can be used to convert DataFrames into other DataFrames containing the needed features.

Figure 5. ML Workflow complexity

  1. Model training. For example, an Estimator can be trained on a DataFrame with features, and produce one or more ML models.

  2. Model evaluation. Predictions based on ML models can be achieved in Spark ML Pipelines using Transformers.

What the future holds

Looking ahead, future development of Spark MLlib will include an API for SparkR , improved Model inspection, along with Pipeline optimizations. Until a new stable release is made available, the Apache Spark team has posted a preview release of Spark 2.0, meant to give the community early access to try out the new version code. This preview includes a new cool feature called Machine Learning Model Persistence[^13], which is an ability to save and load ML models across all programming languages supported by Spark. Moreover, the DataFrame-based Machine Learning API will emerge as the primary ML API, with more support for popular algorithms such as Generalized Linear Models (GLM), Naive Bayes, Survival Regression, and K-Means.