TSM - Java 8 – collections processing

Ovidiu Simionica - Team Lead

Java 8 is beautiful. Yes, I class it as feminine even before reaching the magic number of 8. In this article I will dare to analyze to what extent this beauty is shape versus substance. I admit that Java has constantly left its fans dissatisfied when competing with other programming languages.

Lambda expressions. Yes, C# has Lambda expressions from version 2007 which was launched in 2007. Java needed 7 additional years to offer it. Functional programming came really late.

Generics. An illusion in shape towards template meta-programming which failed through type erasure… Am I the only one having expected something else?

I feel like a Nokia fan who dreams of high resolution and Android. You can imagine how I was waiting while repeatedly pressing the refresh button on the Oracle site for a download of the latest version on the day of the release, like any other Java guy.

With Java on the surgeon"s desk for some time at the date of this article, I ended up sharing impressions on data collection processing, an issue that had been worrying me throughout my career.

What is data collection processing (also known as filtering)?

For those familiar with SQL, filtering is a basic operation on a data set that looks like:

SELECT * FROM Cars WHERE 
Cars.manufacturer = "VW";

What can we do if the collection is already stored in the program"s memory and we want to process it according to the criteria from the previous example?

Until Java 8, the developer had to write:

ArrayList filteredCars = 
  new ArrayList();
 
for(Car c : allCars) {
  if ("VW".equals(
    c.getManufacturer())) {
   
     filteredCars.add(c); 
  }
}

Those at the architect level could also write:

interface Predicate {
  boolean apply(T t);  
}

class Filter { 

public static Collection 
 do (final Collection items, 
 Predicate pred) {
 
   Collection result = 
      new ArrayList<>();
  
  for(T item : items) {
      if (pred.apply(item)) {
        result.add(c); 
      }
    }
   
   return result;
  }
}

In Java 8 we apply:

new Filter().do( allCars, new Predicate() {
  @Override
  
public boolean apply(Car c) {
    return "VW".equals(c.getManufacturer());
  }
);

And this is how Java 8 offers "high resolution".

Since we live in an era of "Big Data", "real-world" applications must be scalable and effective, so it is no longer enough to process millions of records sequentially (if this has ever been enough).

Parallel processing and optimal use of processor cores is now "everyday business".

Java 8 comes to the rescue, enabling us to write:

Collection filtered = 
  allCars.stream().filter( c -> ("VW".equals(
    c.getManufacturer())) ).collect(
      Collectors.toList()); 

Just through this simple call of the "parallel()" method I make sure the stream library will perform magic and divide the stream into tiny pieces that will get processed in parallel.

Job done. Or not?

I like to read java docs, a habit that I consider to be worth cultivating since it can save us from many disasters. I have also been cultivating in time a sort of "sense of danger" for big words like "parallel"; the magic of the result follows shortly after.

What execution order and threads does this method use? Does it use enough? How does it decide how many to use and what happens if the "parallel()" method is also called in parallel?

According to the documentation: "Arranges to asynchronously execute this task in the pool the current task is running in, if applicable, or using the ForkJoinPool.commonPool() if not inForkJoinPool()."

Consequently, we only use one pool by default, regardless of the number of threads the "parallel()" method calls through the entire application, since the stream library uses the ForkJoin library.

Furthermore, the thread that sends the parallel processing job is used itself as worker. The threads of one pool are thus mixed with a thread that has another purpose. If exactly that thread happens to catch a processing part that lasts longer than expected, we run the risk of blocking the processing in the pool just because of the Fork/Join concept"s design (the calling thread functions as worker and the other threads cannot add results while they wait for the parent thread to finish its job). We have a problem!

We have to deal here with the Paraquential phenomenon "[a portmanteau word derived by combining parallel with sequential] The illusion of parallelization. Processing starts out in parallel mode by engaging multiple threads but quickly reverts to sequential computing by restricting further thread commitment." (Edward Hardned, 2014).

The solution proposed by Oracle is the explicit use of a ForkJoinPool controlled by the developer like the following:

Collection filtered = 
  allCars. parallel().filter( c -> 
    ("VW".equals(c.getManufacturer())) ).
     collect(Collectors.toList());

Thus, all the tasks generated by the parallel processing remain in the specified pool.

A positive effect is that we can apply a timeout to the get() method; situation that is usually desired in a real-world application.

This, in turn, takes us back to the fundamental problem: the management of pools is once again the developer "s responsibility! And the difficulties keep piling up when other actors come into play, like the complex situations caused by the multi-thread environment and the careful adjustment of the configurations according to the hardware architecture (e.g. processor). Look how Java gives us half measures again, when we were expecting self-managed thread container (or at least easily-managed).

While, for many of us, the default behavior of the framework for streaming is and will be more than enough, I will keep the parallel method in my "dangerous code" list in the programming and code review activities.

In this brief analysis I only touched the tip of the iceberg. I invite those of you who are curious to learn about other pitfalls of the ForkJoin library to follow with interest a dude with over 30 years of experience, who delves into these issues quite well in his A Java Parallel calamity post.