Task 1: memory access time

As you may know, sometimes the execution time of a program is dominated by the program’s memory accesses. A typical example is that of looping through the elements of an array, in C (ignore BLOCK_SIZE for now):

#include <emmintrin.h>
#include <x86intrin.h>
#include <stdint.h>
#include <stdio.h>

#define SIZE 10
#define BLOCK_SIZE 4096

uint8_t array[SIZE * BLOCK_SIZE];

int main ()
{
    int data = 0;
    volatile uint8_t *addr;
    int i;

    for (i = 0; i < SIZE; i++)
        array[i * BLOCK_SIZE]=1;

    for (i = 0; i < SIZE; i++)
    {
        addr = &array[i * BLOCK_SIZE];
        data = *addr;
        /* Do something */
    }

    return 0;
}

Explanation:

  • In the first for loop the program initialises the array cells i * BLOCK_SIZE with  value 1;
  • In the second loop, the program access those cells. Notice that addr = &array[i * BLOCK_SIZE] computes the address of the cell i * BLOCK_SIZE which is subsequently read by dereferencing the address in data = *addr.

Measuring time in C programs

C programs can access the processor’s timestamp counter calling the function __rdtscp(&AUX) from the header x86intrin.h that returns a 64-bit unsigned integer with the value of the said timestamp. For example, you can get the current CPU timestamp as follows:

unsigned int junk = 0; // This is used as a throw-away variable
register uint64_t timestamp;

timestamp = __rdtscp(&junk);
// Do something with the timestamp

The following program is the same as the previous one: it initialises some array cells in the first for loop and it accesses them in the second for loop. Your task it to create a new file task1.c (use nano task1.c to create the file and edit it), paste the C code below and complete it using __rdtscp to measure and report the number of cycles taken by each access to *addr (see the comments):

#include <emmintrin.h>
#include <x86intrin.h>
#include <stdint.h>
#include <stdio.h>

#define SIZE 10
#define BLOCK_SIZE 4096

uint8_t array[SIZE * BLOCK_SIZE];

int main()
{
    unsigned int junk = 0;
    register uint64_t start, delta, sum = 0;
    volatile uint8_t *addr;
    int i;

    // Bring the array to memory
    for (i = 0; i < SIZE; i++)
        array[i * BLOCK_SIZE] = 1;

    for (i = 0; i < SIZE; i++)
    {
        addr = &array[i * BLOCK_SIZE];
        start = /* Complete me: measure the timestamp here (use junk as a throw-away) */;
        junk = *addr;
        delta = /* Complete me: measure the timestamp and compute the difference w.r.t. start here (use junk as a throw-away) */;
        sum += delta;
        printf("Access time for array[%d*4096]: %ld CPU cycles\n", i, delta);
    }

    printf("Avg. access time: %ld CPU cycles\n", sum/SIZE);
    return 0;
}

As explained in the prerequisites you can compile the program with gcc task1.c -o task1 -static and then execute it in the gem5 simulator with simulate task1.  The advantage of using gem5 is that it makes execution deterministic since no other programs interfere with task1 and the actual architectural details of your host machine do not matter.

  1. Is the time taken for each access relatively constant?
  2. Observe the average number of cycles taken by memory accesses.
Take note of the answers to these two questions (you’ll use them later). The password for the next task is the right-hand side of the assignment to delta with no whitespace and no semicolon.
When ready, go ahead with Task 2.