TSM - Review for “Haskell Data Analysis Cookbook”

Radu Marius Florin - Business & Data Analyst

I have experimented with Haskell various exercises for my curiosity and for learning purposes. I consider myself an old-school statistician, an R programmer and sometimes Python. I am interested in statistics and data analysis, in concepts and new paradigms such as NoSQL, Big Data, MapReduce or functional programming.

Note to readers: this is not an introductory book for Haskell or functional programming but rather the author assumes that the reader is familiar with the syntax and system of types of Haskell - which is significantly different from other programming languages. There are functional programming concepts, such as monad or purity, frequently used in the book.

The book is presented as a beautiful collection of well-organized examples. The reader can find recipes accompanied by brief sections of code designed in general to solve problems of data analysis and processing, but many parts and concepts are addressed to experienced programmers. The book covers a very wide variety of topics regarding programming and data analysis. In the content of the book, a broad set of concepts and techniques are reviewed, which a "complete" analyst should master. However the book doesn't succeed to cover in depth all of these topics and many of them are treated as: Haskell can do this, but this is just an introduction.

What can we achieve with Haskell?

In terms of interactivity with the data (read-evaluate-print loop), the Haskell GHCi is comparable to iPython or REPL from Clojure, but it is quite far from what is actually available in RStudio, or Matlab. During an exploratory analysis process, the interactivity aspect is highly important. While working on the book exercises, I realized that the data investigation in Haskell is not so fast and easy and the work with data structures is not straightforward. In general, a data analyst wants a fast interactivity with the data, a swift investigation to their structure, and graphs generation without too much programming effort. For example in Clojure - another functional language which has a critical mass of users, there is Incanter, a platform for statistical analysis and graphics generation. From what I know, so far there isn't anything similar in Haskell. The fact that Haskell does not offer this easy possibility makes it a second option for a data analyst or statistician. Obviously, these conclusions are highly influenced by my experience with R, Python, or Matlab.

In another train of thoughts, when I put my statistical programmer hat, a programmer who develops data-driven software, I start to appreciate Haskell programming language more and more. Parallelism and concurrency elements of Haskell language make it highly appreciated by a programmer, even if he considers developing a prototype or a small application which uses data sources of a considerable size, spread or with a complex structure (ex. NoSQL, Big Data). Haskell is a purely functional, lazy and statically typed programming language.

This book succeeds in illustrating all these aspects of Haskell programming language very well. A lot of explanations are offered in the context of problems and examples, showing the importance of these peculiarities of the language. For a developer who wants to build data analysis software in Haskell, all the presented examples and models might be very useful.

I've noticed there are some slight differences between the code from the book and code downloaded from GitHub, and this is probably because the GitHub code files are frequently updated. I have used the code already written from GitHub and I have made small changes or adjustments to play with the exercises files. Overall the exercises and the correspondent code have worked well. I had minor problems while installing Haskell libraries. This is more related to my experience while using the book and it concerns the content of the book itself less.

Let's see the facts

Going through the examples, I felt a slight annoyance when I saw that I did not have an equivalent of "data.frame" as in R or Python. In Python this data structure is available using the library "pandas", and R is a primary structure - a structure of a two-dimensional table where each column contains the features of a variable. This data structure supports easier processing of categorical and nominal variables, offering also a more intuitive way of interaction with data sources for in the analysis process. Considering the proportion of mathematicians and researchers of the Haskell community and the pace of the language development, we will probably very soon see specific libraries that address this issue.

In the following paragraphs I will describe the aspects that I've enjoyed the most and also the perceived minuses regarding chapters or topics encountered while reading the book.

The good and the bad parts

The book begins with a chapter on data input - I/O operations. When compared to other programming languages, purity concept is one of the main strengths of Haskell. I/O actions are a kind of "Achilles tendon" when it comes to challenge the purity concept. The author avoids to get stuck with a theoretical discourse gathered around this concept in Haskell. Nishant Shukla begins the book in a pragmatic way with a very important subject for any analyst or programmer: data input - how to enter or import into the Haskell environment. In this chapter, examples regarding input of data are presented in various formats: CSV, JSON or XML. Furthermore, the author provides examples of data gathering from the API, web pages (parsing), or No-SQL databases (e.g. MongoDB). The recipes presented are quite useful and contain a lot of "how-to" code for data processing.

In Chapter 2, generically called "Integrity and Inspection", some distinct categories of data analysis topics are treated. The first part presents recipes for data cleansing (trimming, parsing). Then, in the second part, there begin to appear also recipes for data aggregation, summarization and reporting (frequency tables) and on the third place there are recipes regarding the similarity concept (distances, correlations). All these three topics play a central role in data analysis activity and in this purpose it would be better to give them a higher consideration, and eventually to split them in separated detailed chapters. Cleaning and data aggregation occupies probably more than 80% of the activity of a data analyst. The examples present how specific problems are resolved without using some specialized libraries, but just using Haskell base code instead. When a developer considers building a software application, this approach is very good, and for these cases the examples and code snippets are very helpful. In general, an analyst wants to invest less effort in data scrubbing and preparation, therefore specialized libraries for these purposes would be more appreciated.

In most cases an analyst is rather interested in the information and value from the data and I would say she or he is less focused on code quality, performance or elegance. While reading the book, I would have appreciated to see examples on how to generate in Haskell a column percentages table or how to obtain the average by categories of data. The examples are simple and brief and the code is very nicely written and commented; anyway when it comes to a data report generation it would have been helpful to see examples how to run certain statistical tests fast, at least basic ones like t-test or non-parametric tests.

Further on in the chapter the concepts of similarity, distance and correlation are briefly presented. The Haskell syntax helps the reader to see the code regarding statistical formulas in a form very close to their mathematical expressions. I like very much this aspect that is emphasized in the examples from the book, making the code clearer and easier to read.

After Chapter 2, I have jumped directly to Chapters 7 and 8 - "Statistics and Analysis" and respective "Clustering" chapters which are more relevant for my work as data analyst and statistician. Here I've discovered some very interesting things such as cluster analysis using lexemes, building text n-grams with just few lines of code, or the approximation of a quadratic regression. Also, for these two subjects the author doesn't enter into the depth of topics, and the exercises presented are not so close to real business or practical cases. It is obvious that it is quite hard to get this for such a broad spectrum of topics and concepts. I think it would have been more appropriate and relevant to use larger data sets with more than 3-5 records. I personally can see more added values for "Hello World!" like examples from these chapters when using classic datasets such as "Iris" or "German credit". These kind of classic data sets are frequently used in books or tutorials regarding other programming languages. ​In general, they are useful for benchmarking of data processing, or for illustration of multivariate analysis exercises.

Chapter 9 approaches the parallelism and concurrency programming concepts. The explanation of the concepts and the examples presented were straightforward. The author doesn't go into much detail, but for me, presenting the concepts accompanied by examples of code was very useful.

Similar with the previous chapter, in Chapter 10 - "Real Time Data" the author presents less content regarding data analysis; software engineering problems or cases like streamed data acquisition from Twitter or IRC channels are treated, and real-time communication with sockets.

Chapter 3 - "The Science of Words" covers algorithms for data manipulation and conversion. Chapter 4 - "Data hashing" shows various "hashing" functions. Not only in these chapters but overall the book recipes, the author uses the do-Notation making the examples more readable. Do-Notation is syntactic sugar provided in Haskell, a convenient way to write code, but not a way to write imperative code as someone might misinterpret. Basically, when the compiler encounters a do block, it is translated into bind operators and lambda expressions.

Chapter 5 "The Dance with Trees" and 6 "Graph Fundamentals" were the most difficult but also the most interesting chapters. Chapter 6, regarding graphs, is a natural extension as the author calls it, of the previous chapter regarding the trees. The author Nishant Shukla exemplifies in these two how to use an exceptional and unique software library that is at programmers’ disposal in Haskell - Lens library.

In general, data manipulation in nested immutable data structures is always difficult, but in Haskell the problem is more obvious, as almost everything here is immutable. The Lens Library offers an elegant way to access and manipulate data in complex nested data structures.

The last two chapters are focused on another main topic of interest for data analysts: visualization and presentation of data.

In Chapter 12 - "Exporting and Presenting" the author presents a recipe for building a LaTeX table and this reminds me somehow of the fact that Haskell is a programming language preferred especially by researchers. LaTeX is the standard format for the communication and publishing of scientific papers.

Chapter 11 - "Visualizing Data" covers recipes for producing graphical representations of data using Google Chart API, gnuplot or JavaScript library D3.js. I was impressed how easy other tools and frameworks can be complementarily used with Haskell. I've been surprised also by the Diagrams library utility. It is beautifully designed, declarative drawing library that can be used to create composable drawings. In the last recipe from this chapter the author describes how to mark a route on a map using this library.

All in all, I liked the book very much and I recommend it to all programmers interested in an intellectual challenge. The author has a very broad level of knowledge, managing to approach a wide range of problems from different fields such as statistics, data analysis, programming, software engineering and linguistics. I have found many practical examples in the book, and further on I recommend it to data analysts passionate by functional programming. I am eager to see new books from Nishant Shukla and l will follow with interest the Packt Publishing activity in the "Big Data & Business Intelligence" website section.