Sunday, December 28, 2014

Random bit slogging notes through some performance issues

OK so I've been spending time over the last year occasionally tweaking performance improvements on a multi-core application. This can be a huge timesink. What works best for me is to gather data, try some obvious changes, then get away from the computer and stew on the problem for a bit.

Obviously for the multi-core world, the one goal here is to support scaling as more cores are thrown at a problem. That has meant that performance tweaking requires:

  1.  Avoid locking of any kind, otherwise performance won't scale as more cores are thrown into the stewpot
  2. Minimize cache misses or hot cache reloads, increase cache-coherency
  3. Old fashion instruction tweaking (i.e. reducing instruction costs). 


The above are listed in their approximate order of importance.

I highly recommend watching the videos listed on this posting as they point out that #2 is often more important that #3 in performance tweaking.

Locking can often be avoided by using userspace RCU, or similar tricks.

 Other great resources:

  Performance bit twiddling
  Awesome parallel programming reference
  Detailed Assembly/C/C++ x86 Optimizations

 Obviously one of the great tools is just running perf top, a great deal of insight can be gained just by looking at the results the command below produces:

 sudo /usr/bin/perf top -p <pid>

Pretty much any kind of hardware/software supported events can be profiled, but by default counts are samples per function.

There are a ton of tools out there to help evaluate performance--just make sure that you understand how the data is being captured and presented otherwise you risk getting sucked down the rabbit-hole of false assumptions...