TSM - Interview with Peter Lawrey, special guest at Cluj IT Days 2014

Ovidiu Mățan - Founder @ Today Software Magazine

Peter Lawrey is a Java expert who appeared on the third position in the StackOverflow chart. We have the pleasure to have him as special guest to Cluj IT Days 2014, where he will deliver a technical presentation on “Hot topics in Java Performance Tuning”, as well as a presentation on his experience as an independent consultant: “What I have learnt by becoming an independent consultant”. Moreover, Peter is having a one day workshop on Java Performance (you can find more details on www.itdays.ro). Thus, I had the pleasure of inviting Peter to answer a few technical questions before the event.

[Ovidiu Mățan] How do you see Java 8 from a performance perspective?

[Peter] Perhaps the most significant performance improvement was the extension of Compressed Oops to 64 GB memory sizes.  If you have a heap of between 32 GB and 64 GB you can get improved memory efficiency. You can use compressed oops for 128 GB heaps, but it is unlikely to help at this point.

Often more important than CPU performance is developer performance.  In this regard adding lambdas and the stream API could make a big difference if you can use them effectively.  A developers usually cost more in a year than an expensive server, so saving development time is important, often more important than saving CPU.

Can we consider any classes that behave better from a memory usage point of view than in java 7 or 6?

[Peter] The JSR 310 has an improved DateTime library, it is far better than Calendar for many reason, performance is one of them.

Optimizing applications requires a proper and a repeatable way of testing. Do you usually use a framework or other mechanism in achieving this?

Our clients use Chronicle Queue to record every input to a system for day(s) at a time.  Replaying this persisted queue allows them to recreate obscure bugs and investigate difficult to find performance issues.  Trying to re-create performance problems with a synthetic test is a good start but its hard to find more than half the performance issues that way.  Instead it is better to use a day or a week of real work load.

In the workshop agenda, we have this topic Low level Java programming, how to make using Unsafe safer? which sounds really interesting. Can you give a hint to our readers?

[Peter] In short, you want a library which encourages you to use Unsafe, safely.  We use a library we call Java-Lang which has a thread safe, 64-bit versions of ByteBuffer which allows you access shared memory mapped files, is a relatively safe manner.  What should be more interesting is you can use data structures we built on this library to support queues, Maps and Sets which have simpler interfaces to work with. Both Queue and Map can persist and share data between processes on the same machine at rates of 30+ M operations per second.  Something which would be impossible to do in pure Java any other way.

e.g. You can write

Map map = ChronicleMapBuilder.of(String.class, String.class).createPersistedTo(file);

// you can now use the map as normal.

map.put("hello", "world");

String s = map.get("hello");

What is magic about this is this entry is visible in less than a micro-second to all JVMs on your machine and persisted.  It makes a simple and fast way to store data.  Note: as the map is persisted to disk, it is only limited by the size of your free disk space, not your heap size.  As its off heap, it doesn't contribute to your GC pause time, no matter how big it is.  It also takes little time to reload on restart as it doesn't need to be loaded into the heap.  A 1000M entry map can take 10 ms to be ready to use.

In your latest blog article (link) Chronicle Map and Yahoo Cloud Service Benchmark your proposal was to use 100 bytes values for key -value pair in order to get better results. Still, you mention about garbage collector effect in the tests. How can we also minimise the intervention of the garbage collections in the high performance code.?

In the Yahoo Cloud Service Benchmark, the garbage it produces eventually becomes a constraint. At about 3 million reads/writes per second the benchmark can be using 90% of CPU.  

The best place to start to reduce garbage is to run a realistic tests in a memory profiler.  I use YourKit but there are other good commercial profiles.  Once you can see where most of the garbage is being created you can replace these with alternatives which create little or no garbage e.g. use primitives, use plain objects instead of Maps, recycled mutable objects, or off heap data structures.  

The biggest benefit to reducing garbage is not just the GC pause times, but how fast your code runs between pauses.  Reducing the allocation rate can improve throughput by 2-5x excluding GC pause times.  Lower allocation rates mean you are not filling your CPU caches with garbage every fraction of a milli-second and your threads work more effectively.  The L1 cache is not only 10-20x faster than your L3 cache, but the L3 cache is a shared resource so your threads start getting in each others way the more you use the L3 cache and your program doesn't scale ass well when using more cores.