Vectorization and out of order execution

Vector operations optimization is an important concept in the computer science field. Lots of mathematical algorithms when implemented in a computer are nothing more than operations over arrays, and in most of the time very big ones. Today's work is inspired in a paper that tackles the problem of optimal implementation of micro-kernels. Micro-kernels are a very specific kind of functions that usually operates over large arrays or does lots of computations over a space bounded array. Nevertheless one thing is clear, it is important to identify possible optimizations and benchmark them to find in practice how good they are.

So for this case we will use C language to explore optimizations we can do over array operations. If you are asking why do I choose an old language like C to do this and not some very cool high level language the answer is high level languages don't give you enough control over the processing flow and data structures for optimization purposes. If you take time to read the article that I mentioned you'll see that in a language like Java you can use some techniques to optimize your algorithms but many rely on the JIT compiler and you don't have, again, direct control over the optimizations. In case you are curious about the optimizations that are done by JIT I'll show an example of optimization called method in-lining that is, again, described in the previous paper. So, back again. In short what we want to optimize is the method loopA0 C method.

#define SIZE_DATA 256

double ad[SIZE_DATA] __attribute__((aligned(16)));

void initializeDoubles(){
  int i;
  for(i=0;i<SIZE_DATA;i++){
    ad[i]=i;
  }
}

void loopA0(int n) {
   int i;
   for (i=0; i<n; i++){
      ad[i%255] = ad[i%255]+ad[i%255];
   }
}

So in a nutshell we got a double array with 256 entries and we give some preprocessor macros to tell the compiler to align the data to 16 bytes to optimize the CPU data access. So this is, technically, the first optimization done, guaranteeing that the data buffer is memory aligned. After we create a utility function to populate data into the buffer and finally our loopA0 method. If you notice this will do a very simple operation over the buffer doing the 2 times x operation.

So our main function will look like this

#define ITERATIONS  1000000000

int main(void){
  // Calculate time taken by a request
  struct timeval start;
  struct timeval end;

  initializeDoubles();

  printf("Processing %d iterations\n",ITERATIONS);

  gettimeofday(&start,NULL);
  loopA0(ITERATIONS);
  gettimeofday(&end,NULL);

  printf("Elapsed time LoopA0 %d milliseconds!\n",diffTime(start,end));
}

So what we did? First we define the number of iterations to 1 billion. After we create the needed data structures to capture the time intervals before and after function invocation and then e print the number of milliseconds elapsed. Pretty straightforward, right? In case you are wondering what function is diffTime well that is just a utility we define to compute time intervals, that look like this

int diffTime(struct timeval start,struct timeval end){
  long msStart = start.tv_sec * 1000 + start.tv_usec / 1000;
  long msEnd = end.tv_sec * 1000 + end.tv_usec / 1000;

  return msEnd-msStart;
}

So the first question is, how can we optimize this function. It seems pretty difficult to run this code faster than what we got. Algorithmically speaking yes you are right. We cannot find a more efficient way to adding to values than just adding them, right? Yes but the point here is that in the case of computer system we can add two values with different CPU operations. Code like the previous when disassembled will generate assembly code like the following

00000000004005f4 <loopA0>:
  4005f4:    55                      push   %rbp
  .
  .
  .
  40065b:    c5 fb 10 04 c5 60 30    vmovsd 0x603060(,%rax,8),%xmm0
  400662:    60 00
  400664:    c5 fb 58 c0             vaddsd %xmm0,%xmm0,%xmm0
  400668:    48 63 c1                movslq %ecx,%rax
  40066b:    c5 fb 11 04 c5 60 30    vmovsd %xmm0,0x603060(,%rax,8)
  400672:    60 00
  400674:    83 45 fc 01             addl   $0x1,-0x4(%rbp)
  400678:    8b 45 fc                mov    -0x4(%rbp),%eax
  40067b:    3b 45 ec                cmp    -0x14(%rbp),%eax
  40067e:    7c 84                   jl     400604 <loopA0+0x10>
  400680:    5d                      pop    %rbp
  400681:    c3                      retq   

The previous disassembled code shows that the adding of 64bit floating point numbers is done with the CPU instruction vaddsd which belongs to SSE2 and so the compiler is clever enough to do some optimization in this C naive call. What vaddsd does is

DEST[63:0] ← SRC1[63:0] + SRC2[63:0]
DEST[127:64] ← SRC1[127:64]
DEST[VLMAX-1:128] ← 0

So what is going on here? Well this assembly operation uses 128bit registers called %xmm0, %xmm1.. and does the following operation

X = 128bit = X1,X2, where X1 and X2 are 64 bits
Y = 128bit = Y1,Y2, where Y1 and Y2 are 64 bits

Z=vaddsd(X,Y)=(X1+Y1,X1)

So this operation will add the two 64bit float values and just don't use the second 64bit for anything. So what we do in the second implementation is just that, is replacing this assembly routing by one
that does just that.

void loopA2(int n){
  __ m128d sum;
  __m128d v;

  int i;
  for(i=0;i<n;i+=2){
    v = _mm_load_pd(&(ad[i%256]));
    sum = _mm_add_pd(sum,v);
  }
}

What kind of dark magic is this, right? Well it turns out that _m128d is just a C countainer provided from Intel intrinsics so that we can do just what we describe earlier. _mmloadpd and _mmadd_pd are the C wrapper functions that map into the vmovapd and and vaddpd SSE instructions.

Now if we run the both functions we will have the following benchmark

Processing 1000000000 iterations
Elapsed time LoopA0 3789 milliseconds!
Elapsed time LoopA2 2360 milliseconds!

As you can see we got a substantial increase in performance. This is optimization is what we call vectorization, and its called like this because what we did was just group the same operation under a vector structure and replicate the same operation over the elements of the vector. But is that all? Well not really. We can go further and use an interesting trick which is to parallelize. But how do we do this? Well the CPU is intelligent enough to identify independent instructions and execute them in parallel in a technique we call out of order execution. So we can group several add operations and then accumulate all of them as a next step as group of instructions like this:

void loopA3(int n){
  __m128d sum0,sum1,sum2,sum3;
  __m128d v0,v1,v2,v3;

  int i;
  for(i=0;i<n;i+=8){
    v0 = _mm_load_pd(&(ad[i%256]));
    v1 = _mm_load_pd(&(ad[(i+2)%256]));
    v2 = _mm_load_pd(&(ad[(i+4)%256]));
    v3 = _mm_load_pd(&(ad[(i+6)%256]));
    sum0 = _mm_add_pd(sum0,v0);
    sum1 = _mm_add_pd(sum1,v1);
    sum2 = _mm_add_pd(sum2,v2);
    sum3 = _mm_add_pd(sum3,v3);
  }
}

And the benchmark result goes as follows

Processing 1000000000 iterations
Elapsed time LoopA0 3783 milliseconds!
Elapsed time LoopA2 2359 milliseconds!
Elapsed time LoopA3 1418 milliseconds!

Again a significant amount of improvement. And that's all? Well not really we can replicate these two previous techniques under the AVX instruction set which give us the power to replicate these ideas but under 256bit registers. The two following functions are just that

void loopA4(int n){
  __m256d sum;
  __m256d v;

  int i;
  for(i=0;i<n;i+=4){
    v = _mm256_load_pd(&(ad[i%256]));
    sum = _mm256_add_pd(v,v);
  }
}


void loopA5(int n){
  __m256d sum0,sum1,sum2,sum3;
  __m256d v0,v1,v2,v3;

  int i;
  for(i=0;i<n;i+=16){
    v0 = _mm256_load_pd(&(ad[i%256]));
    v1 = _mm256_load_pd(&(ad[(i+4)%256]));
    v2 = _mm256_load_pd(&(ad[(i+8)%256]));
    v3 = _mm256_load_pd(&(ad[(i+12)%256]));
    sum0 = _mm256_add_pd(sum0,v0);
    sum1 = _mm256_add_pd(sum1,v1);
    sum2 = _mm256_add_pd(sum2,v2);
    sum3 = _mm256_add_pd(sum3,v3);
  }
}

And the final benchmark goes as

Elapsed time LoopA0 3834 milliseconds!
Elapsed time LoopA2 2360 milliseconds!
Elapsed time LoopA3 1421 milliseconds!
Elapsed time LoopA4 911 milliseconds!
Elapsed time LoopA5 727 milliseconds!

So we managed to decrease almost 4 seconds into less than a second just with instruction optimizations.

But you might be thinking well this is prety nice but I'm a java developer and in Java we don't have all this mumbo jumbo wizardry.

Well maybe things are not quite like that. You know java is not just java bytecode running under a process that we call JVM. Current JVM also have optimization mechanisms and one of them is called JIT. The interesting point here is that JIT also does some wizardry. As an example here we will look at a JIT techique called method in-lining. Which is nothing more than to replace method indirection by direct code when necessary. First question how do I know if this stuff is really happening!? Well with a POC? So lets consider the following java program, which is pretty silly I know, nevertheless:

public class JavaInlining {
  private static final void ComputeStuff(int a){
    int b = a*2;
  }
  public static void main(String args[]){
    for(int i=0;i<100;i++){
      ComputeStuff(1);
    }
  }
}

lets compile it, and disassemble it

javac JavaInlining.java
javap -c JavaInLining


Compiled from "JavaInlining.java"
public class JavaInlining {
  public JavaInlining();
    Code:
       0: aload_0       
       1: invokespecial #1                  // Method java/lang/Object."<init>":()V
       4: return        

  public static void main(java.lang.String[]);
    Code:
       0: iconst_0      
       1: istore_1      
       2: iload_1       
       3: bipush        100
       5: if_icmpge     18
       8: iconst_1      
       9: invokestatic  #2                  // Method ComputeStuff:(I)V
      12: iinc          1, 1
      15: goto          2
      18: return        
}

notice that we got an invokestatic that will call the ComputStuff method, well this has some performance penalties, so one technique java does is to replace this method by inline compiled code. But JIT is clever enough to not inline everything wich again would end up penalizing even more but for different reasons. What JIT does is figure if the method has a big overhead and if yes it is tagged as HOT and replaced by inline code.

If we run

java -XX:+PrintCompilation -XX:+UnlockDiagnosticVMOptions -XX:+PrintInlining JavaInlining | grep Stuff

we notice that nothing is outputed and that is because we just run the loop for 100 iterations and that is not enough to turn the method HOT, but now lets replace 100 by 1000000 and check again the output

java -XX:+PrintCompilation -XX:+UnlockDiagnosticVMOptions -XX:+PrintInlining JavaInlining | grep Stuff
     47   16       3       JavaInlining::ComputeStuff (5 bytes)
     47   17       1       JavaInlining::ComputeStuff (5 bytes)
     47   16       3       JavaInlining::ComputeStuff (5 bytes)   made not entrant
                              @ 9   JavaInlining::ComputeStuff (5 bytes)
                              @ 9   JavaInlining::ComputeStuff (5 bytes)
                              @ 9   JavaInlining::ComputeStuff (5 bytes)   inline (hot)

By increasing the loop for a million iterations the JIT compiler notice that ComputStuff is a method which makes sense to optimize and replaces the indirection call by inline code. The point here is that even if you don't need these techniques for your day to day work it is important to be aware of them to code in a way that your virtual machine, for instance, is able to make optimizations over it.