Have you ever found yourself scrutinizing old code you have written years before, realizing in hindsight how slow it actually was, how badly you have written it at the time and pondering how it managed to still work acceptably for such a long time? We did on more than one occasion… After a while, you can't help but wonder what exactly saves you so often from your own stupidity. Could it be luck? Unlikely… And after many years we have finally figured it out. We know now what our guardian angel is. It's the JVM, this smart, amazing machine, about which we Java programmers know so very little.
We all know that the JVM, the Java Virtual Machine, is the platform running our applications, but not all of us are aware that it's not just a blind and passive executor. It's applying a large panoply of optimizations on our code before actually running it. Most of the Java programmers aren't aware of the fact that the code they are writing is never exactly the same as the code actually being run on the JVM. How many of us do routinely think about JIT compilation, method inlining, dead code elimination, escape analysis, loop unrolling, branch prediction, null check elimination and other such wizardry happening behind the scenes in every JVM on every run of our applications?
On the other hand, we have to admit that it's not necessarily a bad thing that most of us don't have to constantly think about these things. After all, a big advantage of the Java environment is precisely that many more programmers can use it much more productively. But, once you actually become aware of these unseen machinations, ignoring them becomes pretty difficult. So let's talk about them a bit, give thanks to them for making our lives a bit easier day-by-day.
Let us do a bit of history. Early JVM versions were pretty simple. They were reading byte-code produced by the Java compiler (javac) and interpreting it directly. This way, the machine code being run was the interpreter's code, while the client code was just a kind of input for it.
To improve performance Just-In-Time (JIT) compilation was introduced in JDK 1.3 which gave the JVM the capability to compile client code (that is code written by the application programmer) directly into machine code and execute that instead of interpreter code. However, that is not the full extent of JIT's capabilities. Being a real compiler gives it access to all the tricks of a static compiler, like the C++ one for example. So, it can do all the optimizations possible when compilation is being done in advance of running the code. Yet, being a dynamic compiler (meaning it's doing its thing during code execution) gives it access to optimization techniques which aren't available to its static cousins.
Most importantly, the dynamic nature of the compilation makes it possible to generate machine code specific to (and optimized for) the actual architecture/hardware the JVM is running on. So, the same Java program can be optimized differently depending on which platform it's actually running on, so that it can make use of the specific features and strengths of that platform.
The JVM dedicates time to gather runtime statistics on the code it's executing (does so on a background thread so that it won't affect execution), for example to watch which parts are run most frequently, and uses this information to decide what is worth optimizing and what is not.
It's also monitoring how certain code is executed. For example, it has the capability to notice that certain code branches are never actually taken and can recompile the code so that those particular branch instructions are completely omitted. This technique is called "branch prediction". Of course, there is a safety check (called "uncommon trap") which will fire if the situation changes (the assumption gets proven wrong) and then the code will be immediately corrected and recompiled without affecting the correctness of the end result. While new compiled versions are being prepared (also on a background thread) the running of the code falls back to interpreted mode (or to a less aggressively optimized compiled version). This way, regardless of how aggressive optimizations turn out to be, they never compromise the correctness of a running program.
Branch elimination can yield much bigger performance improvements than one would normally suspect. This is due to the fact that branches can seriously hinder the job of the instruction pre-fetchers of modern CPUs. If a branch is taken another direction than the one predicted by the pre-fetcher, then the whole instruction cache, the whole execution pipeline, has to be emptied and repopulated and accessing memory (or even the slower caches) can leave the processor without instructions to execute for tens, even hundreds of cycles. No useful progress will be made for a long time (in CPU terms).
Precisely because the JVM carefully observes the execution of code, not everything will get optimized. It wouldn't be practical; more resources would be lost on the analysis than would be gained by the optimization. Considering that most applications spend 80% of their running time on 20% of the code, the so called "hot" portion of the code (see the Pareto principle), heavily optimizing that small portion of code yields most of the improvements that can be achieved anyways.
Let's consider some of the various optimization techniques. Method inlining would be one of them. The idea is to get rid of the cost of method calls by inserting the code from the method's body directly into the methods original call site (and removing the original call). Of course, it wouldn't make sense to inline every single method. This is normally done by the JVM only to tiny methods (who's compiled size is smaller than 35 bytes) and to hot (frequently called) methods having a compiled size below 325 bytes. A method is considered hot if it has been called at least 10,000 times, but in fact all these numeric limits are tunable via various JVM flags.
An interesting aspect of method inlining is that it's also influenced by polymorphism. When a method is called, the JVM checks how many actual implementations that method has in the currently loaded class hierarchy. If there is one, then the call is monomorphic, if there are two then it's bimorphic, and, if there are more, then it's megamorphic. Mono- and bimorphic calls can be efficiently inlined, but megamorphic ones can't. What's most interesting here is that what matters is the number of actually loaded and used implementations, not the number of all possible implementations in the codebase. It's ok to have hundreds of implementations in our codebase; if only one or two of them are being loaded by the classloader in a certain application, our code in that application will remain seriously optimizable. The JVM will notice that it can use cheap mono- and bimorphic calls and will do so.
Another optimization technique, one that's very similar to method inlining, is "loop unrolling". Just like in the case of inlining, a loop will be replaced, completely or partially, with a fixed number of direct calls to the code found in the initial loop body.
"Loop unrolling" example when the loop overhead is relatively big
Null check elimination. When writing code, it's often beneficial to check references if they are null or not (after all more than 70% of all exception happening in production are NullPointerExceptions). In optimized machine code, however, many of the null checks can be eliminated without compromising the safety of the code and the JVM can turn this into significant performance gains. Moreover, in the very rare case when it gets it wrong, when its assumptions are incorrect, the "uncommon trap" steps in and saves the day.
Yet another technique employed by the JVM is "escape analysis" which attempts to determine if a certain object can escape the scope of the current method or thread. If, for example, an object does not "escape" the scope of a method, then it's not necessary to allocate it on the heap, so we don't have to pay the performance penalty of the garbage collector needing to clean it up ("allocation elimination"). The afore mentioned "inlining" can help improve the effectiveness of this technique a lot. On the other hand, if on object is used on a single thread only, then all locks on that object can safely be removed ("lock elision") and we don't have to pay the very heavy cost of synchronization.
Lock coarsening. When entirely eliminating a lock is not an option, sometimes the JVM can merge multiple synchronized blocks of code being called close to each-other (as long as they are using the same lock). An example of this is any synchronized method called from inside of a loop. Considering how big the performance cost of calling a synchronized method is (being a process mediated by the OS) and how unbounded the ignorance of Java programmers towards such facts can be, this technique can also result in huge gains.
Dead code elimination. In real codebases there are many chunks of strange code, usually appearing as the result of multiple refactoring, negligence and other reasons, which doesn't actually influence the results of the code. An example of this is a counter which we increment in a loop, at each iteration, but which we don't actually use for anything, neither inside nor outside the loop. Often, such code can and is detected by the JVM, and then, completely eliminated, a process which can also significantly improve performance. We could even argue that not running a piece of code at all is the ultimate performance improvement that can be done to that particular code… There are, however, situations when this optimization technique can also create problems, see the next example on naive microbenchmark code.
Naive microbenchmark/test code destroyed by the JVM during optimization.
All the techniques we have talked about are just the tip of the iceberg (we haven't even mentioned "code motion", "expression hoisting & sinking", "redundant store elimination", "loop unswitching" and god know what else we haven't even heard about). We just wanted to give you a taste of what was going on in the guts of this amazing machine.
The JVM is not only sophisticated, it's also evolving continuously. The panoply of optimizations is becoming more and more complex over time, as newer versions of the JVM are being released. Java 9 for example has even acquired support for writing dynamic compilers in Java itself, which can potentially replace the ones used by Java by default (see Java Virtual Machine Compiler Interface - JVMCI, JEP 243). The appearance of superior quality compilers or compilers specialized in certain business domains thus becomes possible. Another new option becoming available is Ahead-of-Time compilation for the types of application which can't afford the delay of the JVM gathering pre-optimization statistics, but need peek (or at least somewhat optimized) performance from the first moments of runtime (JEP 295).
So, next time you are bragging to a colleague about the elegance of the code you've just written, stop and think for a second about how much unseen help you are in fact getting and if you could afford all that elegance without the amazing JVM.