As you may know, sometimes the execution time of a program is dominated by the program’s memory accesses. A typical example is that of looping through the elements of an array, in C (ignore BLOCK_SIZE
for now):
#include <emmintrin.h> #include <x86intrin.h> #include <stdint.h> #include <stdio.h> #define SIZE 10 #define BLOCK_SIZE 4096 uint8_t array[SIZE * BLOCK_SIZE]; int main () { int data = 0; volatile uint8_t *addr; int i; for (i = 0; i < SIZE; i++) array[i * BLOCK_SIZE]=1; for (i = 0; i < SIZE; i++) { addr = &array[i * BLOCK_SIZE]; data = *addr; /* Do something */ } return 0; }
Explanation:
- In the first
for
loop the program initialises the array cellsi * BLOCK_SIZE
with value1
; - In the second loop, the program access those cells. Notice that
addr = &array[i * BLOCK_SIZE]
computes the address of the celli * BLOCK_SIZE
which is subsequently read by dereferencing the address indata = *addr
.
Measuring time in C programs
C programs can access the processor’s timestamp counter calling the function __rdtscp(&AUX)
from the header x86intrin.h
that returns a 64-bit unsigned integer with the value of the said timestamp. For example, you can get the current CPU timestamp as follows:
unsigned int junk = 0; // This is used as a throw-away variable register uint64_t timestamp; timestamp = __rdtscp(&junk); // Do something with the timestamp
The following program is the same as the previous one: it initialises some array cells in the first for
loop and it accesses them in the second for
loop. Your task it to create a new file task1.c
(use nano task1.c
to create the file and edit it), paste the C code below and complete it using __rdtscp
to measure and report the number of cycles taken by each access to *addr
(see the comments):
#include <emmintrin.h> #include <x86intrin.h> #include <stdint.h> #include <stdio.h> #define SIZE 10 #define BLOCK_SIZE 4096 uint8_t array[SIZE * BLOCK_SIZE]; int main() { unsigned int junk = 0; register uint64_t start, delta, sum = 0; volatile uint8_t *addr; int i; // Bring the array to memory for (i = 0; i < SIZE; i++) array[i * BLOCK_SIZE] = 1; for (i = 0; i < SIZE; i++) { addr = &array[i * BLOCK_SIZE]; start = /* Complete me: measure the timestamp here (use junk as a throw-away) */; junk = *addr; delta = /* Complete me: measure the timestamp and compute the difference w.r.t. start here (use junk as a throw-away) */; sum += delta; printf("Access time for array[%d*4096]: %ld CPU cycles\n", i, delta); } printf("Avg. access time: %ld CPU cycles\n", sum/SIZE); return 0; }
As explained in the prerequisites you can compile the program with gcc task1.c -o task1 -static
and then execute it in the gem5 simulator with simulate task1
. The advantage of using gem5 is that it makes execution deterministic since no other programs interfere with task1
and the actual architectural details of your host machine do not matter.
- Is the time taken for each access relatively constant?
- Observe the average number of cycles taken by memory accesses.
delta
with no whitespace and no semicolon.