Regular expressions (regexp) are a fundamental tool to quickly parse input and are very useful when interacting with other programs. We have already studied regular expressions in a previous class on “Stream editor and regular expressions” (slides).

## Simple searches

In python, regular expressions are implemented in the re module. The function `re.search(r,s)`

looks for a match of regular expression `r`

into a string `s`

. For example:

>>> import re >>> re.search('([a-z]+)([0-9]+)','hello12345world') <_sre.SRE_Match object; span=(0, 10), match='hello12345'>

Explanation:`re.search`

looks for a match of regexp `([a-z]+)([0-9]+)`

into string `hello12345world`

. The regular expression represents any sequence of at least one lowercase char followed by at least one digit. The search is successful and a Match object is returned. We can notice that position (span) of the matched string is `(0,10)`

meaning that the matched chars are the ones from index `0`

to index `9`

(recall that, in python, the ending boundary `10`

is excluded from the range), and the actual matched string is `hello12345`

.

## Groups

In order to extract the substring that matches a specific part of a regexp we use **groups**. Groups are enclosed in brackets. For example, `([a-z]+)([0-9]+)`

has two groups, one for the lowercase chars `[a-z]+`

and one for the digits `[0-9]+`

. Function `groups()`

, returns a tuple of the strings matching the groups:

>>> re.search('([a-z]+)([0-9]+)','hello12345world').groups() ('hello', '12345')

Explanation: The two substring matching the two groups are: `hello`

and `12345`

.

With function `group(n)`

we can retrieve the single matching string. When `n`

is `0`

we get the full matching string, otherwise we get the `n`

-th group. Function `span(n)`

returns the position in the full string:

>>> re.search('([a-z]+)([0-9]+)','hello12345world').group(0) 'hello12345' >>> re.search('([a-z]+)([0-9]+)','hello12345world').group(1) 'hello' >>> re.search('([a-z]+)([0-9]+)','hello12345world').span(1) (0, 5) >>> re.search('([a-z]+)([0-9]+)','hello12345world').group(2) '12345' >>> re.search('([a-z]+)([0-9]+)','hello12345world').span(2) (5, 10)

Explanation:

`group(0)`

returns the full matching string`hello12345`

`group(1)`

returns the first group`hello`

`span(1)`

returns the position of`hello`

in`hello12345`

, i.e. from index 0 to index 4 (0,5)`group(2)`

returns the first group`12345`

`span(2)`

returns the position of`12345`

in`hello12345`

, i.e. from index 5 to index 9 (5,10)

## Exercise

Use a python regexp to find the only word of 5 lowercase letters preceded by “the” and starting with “d” in `/home/rookie/Python/moby.txt`

. The word is the password for next task!

Hint: read the file content using the following code, then you can search into variable `data`

with open('/home/rookie/Python/moby.txt','r') as f: data = f.read()