Sunday, December 28, 2014

Random bit slogging notes through some performance issues

OK so I've been spending time over the last year occasionally tweaking performance improvements on a multi-core application. This can be a huge timesink. What works best for me is to gather data, try some obvious changes, then get away from the computer and stew on the problem for a bit.

Obviously for the multi-core world, the one goal here is to support scaling as more cores are thrown at a problem. That has meant that performance tweaking requires:

  1.  Avoid locking of any kind, otherwise performance won't scale as more cores are thrown into the stewpot
  2. Minimize cache misses or hot cache reloads, increase cache-coherency
  3. Old fashion instruction tweaking (i.e. reducing instruction costs). 


The above are listed in their approximate order of importance.

I highly recommend watching the videos listed on this posting as they point out that #2 is often more important that #3 in performance tweaking.

Locking can often be avoided by using userspace RCU, or similar tricks.

 Other great resources:

  Performance bit twiddling
  Awesome parallel programming reference
  Detailed Assembly/C/C++ x86 Optimizations

 Obviously one of the great tools is just running perf top, a great deal of insight can be gained just by looking at the results the command below produces:

 sudo /usr/bin/perf top -p <pid>

Pretty much any kind of hardware/software supported events can be profiled, but by default counts are samples per function.

There are a ton of tools out there to help evaluate performance--just make sure that you understand how the data is being captured and presented otherwise you risk getting sucked down the rabbit-hole of false assumptions...

Saturday, May 31, 2014

Awesome! videos on modern CPU performance optimzations

Wow--I finally ended up watching these after sitting on these links for a while. They are just what the doctor ordered if you have questions about low-level source code performance optimizations in modern processors.

Questions on SSE/AVX, pipelining, cache fetch times, memory access, locality of memory access etc. are addressed in these talks. What is fantastic (and repeatedly driven home) is that performance is not necessarily about reducing overall CPU instructions, but reducing memory cache access times (and how to do this).


http://channel9.msdn.com/Events/Build/2014/4-587
http://channel9.msdn.com/Events/Build/2013/4-329


By the way--you can disregard the Microsoft provenance--most of the discussion/techniques equally apply to any modern x86 compiler.

Jo bob says watch'em!

Saturday, April 19, 2014

Computing changed bits in a range of values

So, there's a reason to do this, other than just wasting CPU cycles. I need to take in a range of 2 byte values and compute a mask on all changed bits between the lower and upper value of this range.

The underlying reason is this allows for a quick match against network packet port ranges. So below I have a little test app that computes the changed bits given an arbitrary start stop value.

This ends up producing the following output:

slioch@slioch-All-Series:~/vyatta$ ./range 32076 62199
xxxxxxxxxxxxxxxx
slioch@slioch-All-Series:~/vyatta$ ./range 52076 62199
xxxxxxxxxxxxxx--
slioch@slioch-All-Series:~/vyatta$ ./range 62076 62199
xxxxxxxx--------
slioch@slioch-All-Series:~/vyatta$ ./range 1 2
xx--------------
slioch@slioch-All-Series:~/vyatta$ ./range 1 3
xx--------------
slioch@slioch-All-Series:~/vyatta$ ./range 1 4
xxx-------------
slioch@slioch-All-Series:~/vyatta$ ./range 1 1024
xxxxxxxxxxx-----
slioch@slioch-All-Series:~/vyatta$ ./range 1 1023
xxxxxxxxxx------

The "x" represents a bit position that changes, while the "-" represents a position that doesn't change. This ends up allowing for a quick comparison against a start-stop (or range) of values. Note that the representation above has the LSB (least signficant bit) on the left.



#include <string.h>
#include <strings.h>
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <stdbool.h>
#include <math.h>


#define NUM_BITS_FOUR_BYTES 32

static void
byte_to_binary(unsigned int *x, int num)
{
  unsigned int mask = 0x01;
  int z;
  for (z = 0; z < num; ++z) {
    printf(((*x & mask) == mask) ? "x" : "-");
    mask = mask << 1;
    if (z % NUM_BITS_FOUR_BYTES == (NUM_BITS_FOUR_BYTES - 1)) {
      mask = 0x01;
      x++;
    }
  }
}


/* 
   Calculate fixed/wildcards in ranges of numbers
   for binary representations.
*/
int main(int argc, char **argv)
{
  if (argc < 3) {
    printf("need start and end\n");
    return;  
  }

  long int start = strtoul(argv[1], NULL, 10);
  long int end = strtoul(argv[2], NULL, 10);
  if (end < start) {
    printf("Error in range\n");
    exit(0);
  }

  //now let's see compute all the bits changed in this range
  unsigned int i, accum = 0;
  accum = start ^ end;
  int loc = 32 - __builtin_clz(accum);
  for (i = 0; i < loc; ++i)
    accum |= 0x1 << i;
  byte_to_binary(&accum, 16);
  printf("\n");


}

Wednesday, April 16, 2014

A little fun with L1/L2/L3 cache RAM fetch speed

I know it has been a while since I've posted. And that's pretty much a reflection of being super busy as well as not having a little nugget at the tips of my fingers to post.

So I'm back.

I wrote the little program below to test the cost of fetching values from lookup tables of various sizes. Presumably this would show up in performance differences once the lookup table exceeds the size of the various cache it could easily fit within.

So, my little experiment went as followed: