Task 1: memory access time

As you may know, sometimes the execution time of a program is dominated by the program’s memory accesses. A typical example is that of looping through the elements of an array, in C (ignore BLOCK_SIZE for now):

#include <emmintrin.h>
#include <x86intrin.h>
#include <stdint.h>
#include <stdio.h>

#define SIZE 10
#define BLOCK_SIZE 64

uint8_t array[SIZE * BLOCK_SIZE];

int main ()
{
    int data = 0;
    volatile uint8_t *addr;
    int i;

    for (i = 0; i < SIZE; i++)
        array[i * BLOCK_SIZE]=1;

    for (i = 0; i < SIZE; i++)
    {
        addr = &array[i * BLOCK_SIZE];
        data = *addr;
        /* Do something */
    }

    return 0;
}

Explanation:

  • In the first for loop the program initialises the array cells i * BLOCK_SIZE with  value 1;
  • In the second loop, the program access those cells. Notice that addr = &array[i * BLOCK_SIZE] computes the address of the cell i * BLOCK_SIZE which is subsequently read by dereferencing the address in data = *addr.

Measuring time in C programs

C programs can access the processor’s timestamp counter calling the function __rdtscp(&AUX) from the header x86intrin.h that returns a 64-bit unsigned integer with the value of the said timestamp. For example, you can get the current CPU timestamp as follows:

unsigned int junk = 0; // This is used as a throw-away variable
register uint64_t timestamp;

timestamp = __rdtscp(&junk);
// Do something with the timestamp

The following program is the same as the previous one: it initialises some array cells in the first for loop and it accesses them in the second for loop. Your task it to create a new file task1.c (use nano task1.c to create the file and edit it), paste the C code below and complete it using __rdtscp to measure and report the number of cycles taken by each access to *addr (see the comments):

#include <emmintrin.h>
#include <x86intrin.h>
#include <stdint.h>
#include <stdio.h>

#define SIZE 10
#define BLOCK_SIZE 64

uint8_t array[SIZE * BLOCK_SIZE];

int main()
{
    unsigned int junk = 0;
    register uint64_t start, delta, sum = 0;
    volatile uint8_t *addr;
    int i;

    // Bring the array to memory
    for (i = 0; i < SIZE; i++)
        array[i * BLOCK_SIZE] = 1;

    for (i = 0; i < SIZE; i++)
    {
        addr = &array[i * BLOCK_SIZE];
        start = /* Complete me: measure the timestamp here (use junk as a throw-away) */;
        junk = *addr;
        delta = /* Complete me: measure the timestamp and compute the difference w.r.t. start here (use junk as a throw-away) */;
        sum += delta;
        printf("Access time for array[%d*4096]: %ld CPU cycles\n", i, delta);
    }

    printf("Avg. access time: %ld CPU cycles\n", sum/SIZE);
    return 0;
}

As explained in the prerequisites you can compile the program with gcc task1.c -o task1 -static and then execute it in the gem5 simulator with simulate task1.  The advantage of using gem5 is that it makes execution deterministic since no other programs interfere with task1 and the actual architectural details of your host machine do not matter.

Notice that the time taken for each access is constant except for the first access. This is due to the fact that code also gets into the cache and this interferes with our experiment. In particular, when the junk = *addr; code is not yet loaded in the cache, this will increase the access time of the first iteration.

Your task it to “pad” the code so that this interference disappears. The aim is to make  junk = *addr code belong to the same cache block as the start=... code so that they are loaded together. A way to do this is to simply add some redundant accesses array[0 * BLOCK_SIZE] = 1; between the two for loops. The minimum number of these redundant accesses that makes the access time constant is the password for task 2.

When ready, go ahead with Task 2.