Polyalphabetic ciphers

We have seen that monoalphabetic ciphers are prone to statistical attacks, since they preserve the statistical structure of the plaintext. To overcome this issue, it is important that the same plain symbol is not always mapped to the same encrypted symbol. When this happens the cipher is called polyalphabetic.

Vigenére cipher

This simple polyalphabetic cipher works on “blocks” of m letters with a key of length m. In fact, a key is also a block of m letter. Formally, {\cal P} = {\cal C} = {\cal K} = \mathbb{Z}_{26}^m, where \mathbb{Z}_{26}^m is \mathbb{Z}_{26}\times\mathbb{Z}_{26}\times\ldots\times\mathbb{Z}_{26}, m times. Encryption and decryption are defined as follows:

\sf \begin{array}{rcl} E_{k_1, \ldots, k_m}(x_1, \ldots, x_m) & = & (x_1 + k_1, \ldots, x_m + k_m) \mbox{ mod } 26 \\ D_{k_1, \ldots, k_m}(y_1, \ldots, y_m) & = & (y_1 - k_1, \ldots, y_m - k_m) \mbox{ mod } 26 \end{array}

For example, the sentence “THISISAVERYSECRETMESSAGE” encrypted under key  “FLUTE” is computed as follows:


The plaintext is split into block of length 5 and the key FLUTE is repeated as necessary and used to encrypt each block.

The number of possible keys is 26^m, i.e., all the possible sequence of letters of length m. For m big enough this prevents brute force attacks. Note that the number of keys grows exponentially with respect to the length. You can also notice, from the example, that one letter is not always mapped to the same one (unless the are at a distance that is multiple of m). For example the fist two letters “I” are encrypted as “C” and “M”, respectively. While the “S” in position 6 and the last one are both encrypted as “X” using the “F” of “FLUTE”. They are, in fact, at distance 15 which is a multiple of 5.

Security of the cipher

Even if Vigenére cipher hides the statistic structure of the plaintext better than monoalphabetic ciphers, it still preserve most of it. There are two famous methods to break this cipher. The first is due to Friedrick Kasiski (1863) and the second to Wolfe Friedman (1920). We illustrate the latter since it is more suitable to be mechanized. Both are based on the following steps:

  1. Recover the length m of the key
  2. Recover the key

The Friedman method uses statistical measures. In particular for step 1 we consider the index of coincidence:

\displaystyle I_c(x)=\frac{\sum_{i=1}^{26} f_i(f_i -1)}{n(n-1)}\approx\sum_{i=1}^{26}p_i^2

where f_i is the frequency of i-th letter in a text of length n, i.e., the number of times it occurs in such a text, and p_i = \frac{f_i}{n} is the probability of letter i-th. Intuitively, this measure gives the probability that two letters, chosen at random from a text, are the same.
Notice that the value of the index is minimum, with value 1/26 ≈ 0.038, for texts composed of letters chosen with uniform probability 1/26 while it is maximum, with value 1, for texts composed of just a single letter repeated n times. It is, in fact, a measure of how non-uniformely letters are distributed in a text. So each natural language has a characteristic index of coincidence. For English the value is approximatively 0.065

1) Recover the length m of the key. The idea is to find the length by brute-forcing, following this algorithm:

m = 1
LIMIT=0.06 # this is to check that ICs are above 0.06 and thus close to 0.065
found = False
while(not found):
    sub = subciphers() # takes the m subciphertexts sub[m] obtained by selecting one letter every m
    found = True
    for i in range(0,m): # compute the Ic of all subtexts
        if Ic(sub[i]) < LIMIT:
             # if one of the Ic is not as expected try to increase length
             found = False
             m += 1
# survived the check, all Ic's are above LIMIT

For example, for length 2 we get the two subciphertexts composed of only the letters in odd positions and even positions, respectively. We require that the index of coincidence of all the subtexts is close to the characteristic index of the plaintext language. Typically, the bigger is the index the higher is the probability that the frequencies of the letters are close to the one of the plaintext language. It is thus enough to choose a lower bound such as 0.06 and check that the ICs are above that.

2) Recover the key. We find the relative shift between two subciphers by using the mutual index of coincidence 

\displaystyle MI_c(x,x')=\frac{\sum_{i=1}^{26} f_i f'_i }{n n'}=\sum_{i=1}^{26}p_i p'_i

representing the probability that two letters taken from two texts x and x' are the same. The idea is to shift one subcipher until the mutual index of coincidence with the first subcipher becomes close to the one of the plaintext language. When this happens, we know that the applied shift is the relative shift between the two subciphers and, consequently, between the corresponding letters of the key. This is encoded in the following algorithm. In fact, what we do, is to select the relative shift that maximizes the mutual index of coincidence.

key = [] # empty list
for i in range(0,m): # for any letter of the key
    k = 0      # current relative shift
    mick = 0   # maximum index so far (we start with 0)
    for j in range(0,26): # for any possible relative shift
        # compute the mutual index of coincidence between the first subcipher 
        # sub[0] and the i-th subcipher shifted by j
        mic = MIc(sub[0], shift(j,sub[i]))
        if mic > mick: # if it is the biggest so far
            k = j      # we remember the relative shift
            mick = mic # ... and the new maximum
    key.append(k)      # we append to the list the shift we have found 

In the list key we obtain the list of relative shifts. For example key = [0,4,6,3,9] means that the second letter of the key is equal to the first plus 4 while the third is the first plus 6 and so on. The final step is to try all the possible 26 first letter of the key, giving 26 possible keys. (This step could be avoided computing the MIc with a reference text written in the plaintext language.)

Challenge. Can you write a program that breaks the following ciphertext?