Rule #1, you gotta have FUN!

István Kiss
Software Engineer @ FlowTraders

PROGRAMMING

Actually that should be Rule #Zero!

Rule #1: Just don't do it!

According to Knuth, Computer Programming as an Art (1974), "premature optimization is the root of all evil (or at least most of it) in programming." Has the truth behind this statement been changing in the last forty years? Let's look at some examples in C++, as even today, it is a synonym for performance. Let's start with a simple function - division by four.

In the image below, on the left, you can see the standard division code and on the right - its output after the compiler optimized the build:

unsigned int div_by_4(unsigned int num)
{
  return num / 4;
}

div_by_4(unsigned int):
mov     eax, edi
shr     eax, 2
ret

I use Compiler Explorer to check the compiled result in an assembly. When I searched for best practice tips for speed optimization in C++ on the internet, I got this tip: "use bitwise shifting for division or multiplication for power of two."

unsigned int div_by_4(unsigned int num)
{
  return num >> 2;
}

mov     eax, edi
shr     eax, 2
ret

In the image below, on the left, you can see the bitwise shifting code and on the right - its output after the compiler optimized the build. If you compare the outputs of both versions, you will see that it is exactly the same thing: shift right the value from the EAX register with 2. So, what did we achieve by using bitwise shift instead of standard division? We made the code harder to read, even if it takes just half a second, we have to ask ourselves: what does shifting right by 2 mean? A well-versed C++ developer could say that they can read the second version as easy as the first one. What about the following sample? Let's remove the unsigned keyword. Will it change the output or will it be exactly the same?

int div_by_4( int num )
{
    return num / 4;
}

div_by_4(int):
lea     eax, [rdi+3]
test    edi, edi
cmovns  eax, edi
sar     eax, 2
ret

I have bad news for those who bet on no change. Instead of a single shift right instruction, there are two shift instructions around the most significant bit, followed by the familiar shift right with two. So, what happened? Now the function can also be used to divide negative numbers, and that changes the algorithm. So, by optimizing the code, we actually could introduce an unobvious bug.

Let's check another example. What do you think about calculating 10!

int main(  )
{
    int result = 1;
    for (int i = 1; i <= 10; ++i)
    {
        result *= i;
    }
    return result;
}

main:
mov     eax, 3628800
ret

There is a single move instruction: the compiler calculated the result for us, and replaced our FOR cycle with a constant number. Now this is an optimization!

What about inlining? In the old C days, you could use macro functions, but I would strongly advice against them, except if you really know what you are doing and how the resulting code will behave. C++ has got the inline keyword, so we instruct the compiler to inline the function.

I created a simple template function to return the bigger value out of two input parameters. If you check the assembly output of both the inlined and not inlined versions, you will see that there is no difference.

#include 

template 
T const& MAX(T const& a, T const& b)
{
    return (a < b ) ? b : a ;
}

int main(  )
{
    int a = 0;
    std::cin >> a;
    return MAX(a, 42);
}

main:                                   # @main
push    rax
mov     dword ptr [rsp + 4], 0
lea     rsi, [rsp + 4]
mov     edi, std::cin
call    std::basic_istream >::operator>>(int&)

mov     ecx, dword ptr [rsp + 4]
cmp     ecx, 41
mov     eax, 42
cmovg   eax, ecx
pop     rcx
ret

Clang noticed that the function should be inlined.

#include 

template 
inline T const& MAX(T const& a, T const& b)
{
    return (a < b ) ? b : a ;
}

int main(  )
{
    int a = 0;
    std::cin >> a;
    return MAX(a, 42);
}

main:                                   # @main
push    rax
mov     dword ptr [rsp + 4], 0
lea     rsi, [rsp + 4]
mov     edi, std::cin

call    std::basic_istream >::operator>>(int&)

mov     ecx, dword ptr [rsp + 4]
cmp     ecx, 41
mov     eax, 42
cmovg   eax, ecx
pop     rcx
ret

I would say that you could do more harm by forcing the compiler to inline certain functions than by letting it decide if the functions are worth to be inlined or not.

In fact, today both GNU and Google provide a lot of resources to help you optimize the compile time of compilers. Please use the work results of those well-paid developers with PhD when possible, and don't waste your precious time trying to do something that a computer could do better. In most cases you can really forget about things like dead code elimination loop unfolding, even inlining. And if you really want to do it manually, please use an IDE, for example, InteliJ/CLion by JetBrain that provides pretty good code analysis and refactoring features.

Rule #2: If you really have to do it: MEASURE it!

Here we go. Where poor performing code is fed to the compiler, the code will not be optimized - there is no such thing as a free lunch. Even if a compiler is pretty smart, it's not a magic wand. Yet, we, The Developers, usually have the preconception that a certain part of the code is performing poorly, and so we start to optimize that part. It might take you half a week of hard work, and the results are disappointing. Let's step back. How sure are we that it was exactly that part of the code which was the most time consuming?

Did we check? I once developed an application that parsed a text file, did some work on the data, and placed the results via Hibernate into a database. The application was slow and was throwing a memory exception. The processing didn't seem to have any visible issues, so the ideal candidate for the optimization was the file parsing process. After a few days spent on it, I was able to parse the file about 50% faster according to my test cases. And what a surprise! When I integrated the solution into the original application, I could hardly see any improvement. I took one step back, and did what I was supposed to do from the beginning: profiled the whole application. The results were surprising: less than 5% of time was spent on file parsing, less than 10% - on the processing, and the rest was on SQL calls. It was performing distinct insert calls for each line processed. Batching multiple inserts into one call solved the issue.

Lesson learned: start with a high level profiling of your application. Find the actual hotspots. For C++, I would recommend the perf toolset: it's free, powerful and you can run it in a console. Intel's VTune improved a lot in the last years, it has a nice (sometimes overwhelming) graphical interface, but you have to pay money for it. For Java applications, I prefer the NetBeans profiler, or if you want to use the big guns - JProfiler. Once the real problem areas are identified, we can go into micro-benchmarking with tests that run on both the existing code and the improved one, so we

can measure the improvement. Thus, the second rule is: don't trust anyone, measure it. And measure the right thing. Measuring the wrong thing will lead to a wrong solution.

Rule #3: Know your tools

When talking about performance optimization, most people might think about abstract things like a special OS, some black magic line in the code that only a few selected elders can understand. Yes, you can go to that level too, but usually when every low-hanging fruit was harvested, you fight for the last nanoseconds.

We run a small contest asking developers to handle a really easy problem. It had an O(n) solution and an O(1) one. To my surprise, less than 0.5% of the participants implemented a constant complexity solution. So, please take your time, if needed, grab that paper and pen, and make sure that there is no better algorithm for your problem. It will result in one reason less for later optimization.

Memory is cheap. Indeed, when thinking of today's hardware, it is really cheap to add some more physical memory modules. However, managing a huge amount of memory can reduce performance. In C++, in modern architecture, calling the new function to allocate a memory space costs between 160 and 300 CPU cycles. By comparison, performing a multiplication takes 1-2 cycles, a division - around 30-35 cycles. It is not too much, but if your code performs, let's say, ten multiplications and one allocation/deallocation: 2 cycles of multiplications by 10 will be around 20 cycles of doing useful work. Let's count just the new with 160 cycles. It would mean that just 11% of load was spent on the actual work, 89% could be optimized. Java and .Net developers might jump in now telling that they manage memory and ultra-performant garbage collectors.

Taking one step towards the principle "the best code is no code at all", I would say that not executed code is the fastest one. Just think about this when the garbage collector puts your running code in JVM to pause, to be able to clean up the memory. If possible, avoid dynamic memory allocation. I recommend that you allocate memory at startup and reuse it later.

public String FooFunction (String input1, 
String input2, String input3)
{
    return "Input1 = " + input1 + "Input2 = " 
           + input2 + "Input3 = " + input3 ;
}

Immutable objects. In Java, we can easily forget that Strings are immutable. How many times did you write something similar to this: And how many times did you think that this would be translated to something similar to this:

public String FooFunction (String input1, 
String input2, String input3)
{
    String anon1 = "Input1 = ";
    String anon2 = anon1 + input1;
    String anon3 = "Input2 = ";
    String anon4 = anon2 + anon3;
    String anon5 = anon4 + input2;
    String anon6 = "Input3 = ";
    String anon7 = anon5+ anon6;
    String anon8 = anon7 + input3;

    return anon8;
}

It will generate a few "garbage" string objects, right? This would be acceptable if FooFunction were called just a few times. But what happens if you use a similar function to build up an SQL query with, let's say, 20 variables, and you call it a couple of million times? The amount of garbage can increase exponentially. At a certain point, it might be a good idea to use a static string builder, and clean it up each time you want to concatenate another set of strings.

Usage of Vectors. What is a vector? According to www.cppreference.com: It is a sequence container that encapsulates dynamic size arrays. Great, it means that we don't have to bother about the size of an array; if we need more space it will magically appear. And we pay the price in CPU cycles.

Why? Vectors usually have a feature that makes them fast when you iterate over their elements because elements are stored contiguously. Let's look at the following example. You place ten elements in a vector. And you want to place one more. However, there is no space allocated for that one. What happens? The container allocates a new contiguous memory space bigger than the original one, usually more than you need.

Afterwards, it copies the old memory space to the new one and deallocates the old one. What happens when you do this several million of times? The memory gets fragmented. It will take more time to make a contiguous memory space big enough, it will take more time to copy from the old location to the new one. And if you are lucky, the garbage collector (if you have one) already has done its job, and you won't run into some ugly "out of memory exception" messages. We analyzed just some generic examples, but I hope that I convinced you to refresh your memory on what happens "under the hood" and use your tools as they were intended to be used.