So I'm back.
I wrote the little program below to test the cost of fetching values from lookup tables of various sizes. Presumably this would show up in performance differences once the lookup table exceeds the size of the various cache it could easily fit within.
So, my little experiment went as followed:
Size of LUT: Clock ticks:
10k 100k
100k 110k
1m 90k
10m 760k
100m 1000k
1g 1800k
From end to end there's a roughly 20x performance difference for the same number of operations. Presumably this is the difference between the 10k, 100k, 1m size array fitting comfortably in the L3 cache, while the 1g fits in RAM. I'm running on a haswell 4770S which has 256kb of L1, 1m of L2, and 8m of L3 cache size, and since the lookup table contains a table of pointers (8 bytes). The LUT table size should be multipled by 8 to get the total memory requirements.
Interesting (but not surprising) with the 20x performance differential when abusing memory.
int main() { uint32_t *tbl; //size of table below, i.e. 1G uint32_t size = 1000 * 1000 * 1000; tbl = calloc(sizeof(uint32_t*) * size, 1); uint32_t i; for (i = 0; i < size; ++i) { if (rand() % 2) tbl[i] = 1; } uint32_t ct = 100000000; //start timing here int accum = 0; clock_t start, end; start = clock(); while (ct) { accum += tbl[ct % size]; --ct; } //done with timing here end = clock(); printf("done: %ld\n", end - start); printf("really done: %d\n", accum); }
No comments:
Post a Comment