Java, NUMA and Memory Boundaries

Benchmarking Memory

Caches and NUMA

Today post is all about tinkering with memory. This experiments were inspired by a investigation of a friend/coworker/geek_soldier regarding memory operations. His investigation is broader and the examples will be, certainly more complex. Here we will tackle the simpler of the memory analysis problem. Crossing boundaries between L1,L2 and L3 caches. Most modern CPU architectures will have some form of NUMA architecture. NUMA is a complex memory layout that tries to exploit the phenomenon of data locality. In short L1,L2,L3 are three levels of memory.

L1 is the closest to CPU and L3 the farthest, and then, of course we got main memory. To be able to tinker with all these kinds of memories we need first to identify them on our laptop. For that, assuming you are using a linux distribution, you can use lscpu. For instance on my laptop I got following description

lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                4
On-line CPU(s) list:   0-3
Thread(s) per core:    2
Core(s) per socket:    2
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 61
Stepping:              4
CPU MHz:               2105.593
BogoMIPS:              5188.06
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              4096K
NUMA node0 CPU(s):     0-3

The experiment

The all purpose of today experiment is to answer a very simple question. How does these memories impact performance? We know that in terms of speed the following relationship holds L1

Benchmarking with JMH

To benchmark this low level constructs we need the help of a programming language and a benchmarking framework. We could use C and the compiler directives to align memory and loops to iterate over the memory. But for this experiment we don't really need a low level language like C to uncover the profile of these four types of memory. We can and will use java. We just need a framework to help us with the measurement. For this last purpose the choice is the industry standard of JMH. This tool is a mature and very easy one to tinker with and so is a natural choice for this kind of work. The next step is just the formalization of these ideas.

Swap my letter

To sum up all these ideas we create a simple letter swap algorithm that will pseudo-randomly swap letter in a byte[]. The idea is to create several byte[] structures of this form

byte[] living only in L1 cache
byte[] crossing boundaries between L1 and L2
byte[] crossing boundaries between L1, L2 and L3
byte[] crossing boundaries between L1, L2, L3 and main memory

Since the letter swapping is kind of random in the case that byte[] will cross a boundary a latency penalty should be noticed. So for that we devise this naive, but hopefully enough, algorithm.

  private void shuffleLetters(byte[] letters,int iterations){
    int p1,p2;
    byte aux;
    for(int i=0;i<iterations;i++){
        p1=(PRIME_NUMBER*i)%letters.length;
        p2=(PRIME_NUMBER*(i+1))%letters.length;

        aux=letters[p1];
        letters[p1]=letters[p2];
        letters[p2]=aux;
    }
  }

We choose a prime number to avoid collisions of positions due rest of division (is just a nice property) that comes directly from number theory more specifically prime number theory. Aside from that this is a pretty simple algorithm to shuffle the letters inside the array. Next we need to define the size for the array. For that we take into account the output we got previously and build the following properties

  private static int MAX_ITERATIONS = 1000*1000*500;      //Max number of iterations
  private static int PRIME_NUMBER = 7919;                 //PRIME_NUMBER for shuffling
  private static int L1_CACHE_SIZE=1024*32;               //32 Kilobytes of data
  private static int L2_CACHE_SIZE=1024*256;              //256 Kilobytes of data
  private static int L3_CACHE_SIZE=1024*4096;             //4 Megabytes of data
  private static int RAM_SEGMENT = 1024*1024*10;          //10 Megabytes of data

  private static byte[] LETTERS_L1 = populateByteArray((int)Math.ceil(L1_CACHE_SIZE/2));
  private static byte[] LETTERS_L2 = populateByteArray((int)Math.ceil(L2_CACHE_SIZE/2));
  private static byte[] LETTERS_L3 = populateByteArray((int)Math.ceil(L3_CACHE_SIZE/2));
  private static byte[] LETTERS_RAM_SEGMENT = populateByteArray((int)Math.ceil(RAM_SEGMENT/2));

And finally we just need to create the four benchmarks previously defined

      @Benchmark
      public void l1CacheSizeToying() {
          shuffleLetters(LETTERS_L1,MAX_ITERATIONS);
      }

      @Benchmark
      public void l2CacheSizeToying() {
          shuffleLetters(LETTERS_L2,MAX_ITERATIONS);
      }

      @Benchmark
      public void l3CacheSizeToying() {
          shuffleLetters(LETTERS_L3,MAX_ITERATIONS);
      }

      @Benchmark
      public void ramCacheSizeToying() {
          shuffleLetters(LETTERS_RAM_SEGMENT,MAX_ITERATIONS);
      }

We also implemented a second swapping method. This second method only swaps the ends of the array. Theoretically this should end up with worst cases than the random algorithm since it will hit the ends of the array. Intuitively this should mean that we would end up swapping two characters belonging to different locations in the memory. Since to read the first char, part of the array should be copied to L1 and then the

Next we just need to compile and run the program. The compilation is a traditional mvn clean install. To run the program we decided to run with a warm-up cycle of 3 iterations and a run of 3 iterations also. This process was repeated 10 times for each benchmark. The result follows

    # Run complete. Total time: 00:09:07

    Benchmark                           Mode  Cnt     Score    Error  Units

    CacheBenchmark.l1CacheSizeRandom   thrpt   30  1343.130 ± 68.243  ops/s
    CacheBenchmark.l2CacheSizeRandom   thrpt   30  1374.820 ± 35.100  ops/s
    CacheBenchmark.l3CacheSizeRandom   thrpt   30  1376.034 ± 34.804  ops/s
    CacheBenchmark.ramCacheSizeRandom  thrpt   30  1377.505 ± 35.628  ops/s

    CacheBenchmark.l1CacheSizeEnds     thrpt   30  7846.218 ± 10.313  ops/s
    CacheBenchmark.l2CacheSizeEnds     thrpt   30  7854.716 ± 10.566  ops/s
    CacheBenchmark.l3CacheSizeEnds     thrpt   30  7848.913 ± 10.017  ops/s
    CacheBenchmark.ramCacheSizeEnds    thrpt   30  7851.176 ± 12.332  ops/s

The results are pretty interesting. The intuitive interpretation turns to be completely wrong and the apparently worst case scenario turns to be the best one. More, it appears that the throughput is independent on the size of the array, and as a consequence independent on the location of the data in the memory. A possible interpretation for this is that the java compiler is clever enough to avoid memory movement since the only data changed in the array is the first and last char. So it is not a fundamental reason to copy data between caches since only two bytes are being updated at each iteration.

After some tweeking we arrive at this swap function

  private void shuffleLettersWithSpace(byte[] letters,int iterations,int space){
    int p1,p2;
    byte aux;
    int offset1 = space == 0 ? L1_CACHE_SIZE : space;
    int offset2 = space == 0 ? 0 : space;
    for(int i=0;i<iterations;i++){
        p1=(PRIME_NUMBER*i)%offset1;
        p2=space+(PRIME_NUMBER*(i+1))%(letters.length-offset2);
        aux=letters[p1];
        letters[p1]=letters[p2];
        letters[p2]=aux;
    }
  }

and executed the tests with the following parametrization

  @Benchmark
  public void randomL1CacheSize() {
      shuffleLettersWithSpace(LETTERS_L1,MAX_ITERATIONS,1);
  }

  @Benchmark
  public void randomL2CacheSize() {
      shuffleLettersWithSpace(LETTERS_L2,MAX_ITERATIONS,L1_CACHE_SIZE);
  }

  @Benchmark
  public void randomL3CacheSize() {
      shuffleLettersWithSpace(LETTERS_L3,MAX_ITERATIONS,L2_CACHE_SIZE);
  }

  @Benchmark
  public void randomRamCacheSize() {
      shuffleLettersWithSpace(LETTERS_RAM_SEGMENT,MAX_ITERATIONS,L3_CACHE_SIZE);
  }

With this last changes we were able to break locality between swaps and therefore the results ended up following our initial intuition

        # Run complete. Total time: 00:13:18

        Benchmark                                Mode  Cnt     Score     Error  Units
        CacheBenchmark.dummyRandomL1CacheSize   thrpt   30  2969.995 ±  76.271  ops/s
        CacheBenchmark.dummyRandomL2CacheSize   thrpt   30  2842.415 ±  94.449  ops/s
        CacheBenchmark.dummyRandomL3CacheSize   thrpt   30  1840.950 ± 148.141  ops/s
        CacheBenchmark.dummyRandomRamCacheSize  thrpt   30  2651.015 ±  78.600  ops/s

        CacheBenchmark.endsL1CacheSize          thrpt   30  7886.504 ±  10.746  ops/s
        CacheBenchmark.endsL2CacheSize          thrpt   30  7888.736 ±   9.530  ops/s
        CacheBenchmark.endsL3CacheSize          thrpt   30  7885.065 ±   9.686  ops/s
        CacheBenchmark.endsRamCacheSize         thrpt   30  7882.311 ±   9.807  ops/s

        CacheBenchmark.randomL1CacheSize        thrpt   30  2797.702 ± 179.115  ops/s
        CacheBenchmark.randomL2CacheSize        thrpt   30  1584.330 ±   3.536  ops/s
        CacheBenchmark.randomL3CacheSize        thrpt   30  1087.367 ±  56.219  ops/s
        CacheBenchmark.randomRamCacheSize       thrpt   30   462.359 ±   4.232  ops/s

Conclusion

This exercise as a very interesting one. The several approaches used to break the locality pattern of data access showed us how complex the memory layout is and how surprising the results could be if we ignore the memory architecture. The lesson here is that we should avoid at all costs random access memory to minimize need to move data between cache layers. The work done here is just an heads up for the need to take into account lower level constrains, such computer architecture. Some people tend to spread the idea that high level languages decouple the programmer from the need to understand under the wood. This is just a simple example to point the other direction. People that work with higher level languages, like java, should understand lower level concepts to take his programs to a higher level

NOTE:

Thanks again to Luis Silva to share and inspire this kindergarten experiment