As you may know, sometimes the execution time of a program is dominated by the program’s memory accesses. A typical example is that of looping through the elements of an array, in C (ignore BLOCK_SIZE
for now):
#include <emmintrin.h> #include <x86intrin.h> #include <stdint.h> #include <stdio.h> #define SIZE 10 #define BLOCK_SIZE 64 uint8_t array[SIZE * BLOCK_SIZE]; int main () { int data = 0; volatile uint8_t *addr; int i; for (i = 0; i < SIZE; i++) array[i * BLOCK_SIZE]=1; for (i = 0; i < SIZE; i++) { addr = &array[i * BLOCK_SIZE]; data = *addr; /* Do something */ } return 0; }
Explanation:
- In the first
for
loop the program initialises the array cellsi * BLOCK_SIZE
with value1
; - In the second loop, the program access those cells. Notice that
addr = &array[i * BLOCK_SIZE]
computes the address of the celli * BLOCK_SIZE
which is subsequently read by dereferencing the address indata = *addr
.
Measuring time in C programs
C programs can access the processor’s timestamp counter calling the function __rdtscp(&AUX)
from the header x86intrin.h
that returns a 64-bit unsigned integer with the value of the said timestamp. For example, you can get the current CPU timestamp as follows:
unsigned int junk = 0; // This is used as a throw-away variable register uint64_t timestamp; timestamp = __rdtscp(&junk); // Do something with the timestamp
The following program is the same as the previous one: it initialises some array cells in the first for
loop and it accesses them in the second for
loop. Your task it to create a new file task1.c
(use nano task1.c
to create the file and edit it), paste the C code below and complete it using __rdtscp
to measure and report the number of cycles taken by each access to *addr
(see the comments):
#include <emmintrin.h> #include <x86intrin.h> #include <stdint.h> #include <stdio.h> #define SIZE 10 #define BLOCK_SIZE 64 uint8_t array[SIZE * BLOCK_SIZE]; int main() { unsigned int junk = 0; register uint64_t start, delta, sum = 0; volatile uint8_t *addr; int i; // Bring the array to memory for (i = 0; i < SIZE; i++) array[i * BLOCK_SIZE] = 1; for (i = 0; i < SIZE; i++) { addr = &array[i * BLOCK_SIZE]; start = /* Complete me: measure the timestamp here (use junk as a throw-away) */; junk = *addr; delta = /* Complete me: measure the timestamp and compute the difference w.r.t. start here (use junk as a throw-away) */; sum += delta; printf("Access time for array[%d*4096]: %ld CPU cycles\n", i, delta); } printf("Avg. access time: %ld CPU cycles\n", sum/SIZE); return 0; }
As explained in the prerequisites you can compile the program with gcc task1.c -o task1 -static
and then execute it in the gem5 simulator with simulate task1
. The advantage of using gem5 is that it makes execution deterministic since no other programs interfere with task1
and the actual architectural details of your host machine do not matter.
Notice that the time taken for each access is constant except for the first access. This is due to the fact that code also gets into the cache and this interferes with our experiment. In particular, when the junk = *addr;
code is not yet loaded in the cache, this will increase the access time of the first iteration.
Your task it to “pad” the code so that this interference disappears. The aim is to make junk = *addr
code belong to the same cache block as the start=...
code so that they are loaded together. A way to do this is to simply add some redundant accesses array[0 * BLOCK_SIZE] = 1;
between the two for
loops. The minimum number of these redundant accesses that makes the access time constant is the password for task 2.