Task 1: regexp in python

Regular expressions (regexp) are a fundamental tool to quickly parse input and are very useful when interacting with other programs. We have already studied regular expressions in a previous class on “Stream editor and regular expressions” (slides).

Simple searches

In python, regular expressions are implemented in the re module. The function re.search(r,s) looks for a match of regular expression r into a string s. For example:

>>> import re
>>> re.search('([a-z]+)([0-9]+)','hello12345world')        
<_sre.SRE_Match object; span=(0, 10), match='hello12345'>

Explanation:re.search looks for a match of regexp  ([a-z]+)([0-9]+) into string hello12345world. The regular expression represents any sequence of at least one lowercase char followed by at least one digit. The search is successful and a Match object is returned. We can notice that position (span) of the matched string is (0,10) meaning that the matched chars are the ones from index 0 to index 9 (recall that, in python, the ending boundary 10 is excluded from the range), and the actual matched string is hello12345.

Groups

In order to extract the substring that matches a specific part of a regexp we use groups. Groups are enclosed in brackets. For example, ([a-z]+)([0-9]+) has two groups, one for the lowercase chars [a-z]+ and one for the digits [0-9]+. Function groups(), returns a tuple of the strings matching the groups:

>>> re.search('([a-z]+)([0-9]+)','hello12345world').groups()
('hello', '12345')

Explanation: The two substring matching the two groups are: hello and 12345.

With function group(n) we can retrieve the single matching string. When n is 0 we get the full matching string, otherwise we get the n-th group. Function span(n) returns the position in the full string:

>>> re.search('([a-z]+)([0-9]+)','hello12345world').group(0)
'hello12345'
>>> re.search('([a-z]+)([0-9]+)','hello12345world').group(1)
'hello'
>>> re.search('([a-z]+)([0-9]+)','hello12345world').span(1) 
(0, 5)
>>> re.search('([a-z]+)([0-9]+)','hello12345world').group(2)
'12345'
>>> re.search('([a-z]+)([0-9]+)','hello12345world').span(2) 
(5, 10)

Explanation:

  • group(0) returns the full matching string hello12345
  • group(1) returns the first group hello
  • span(1) returns the position of hello in hello12345, i.e. from index 0 to index 4 (0,5)
  • group(2) returns the first group 12345
  • span(2) returns the position of 12345 in hello12345, i.e. from index 5 to index 9 (5,10)

Exercise

Use a python regexp to find the only word of 5 lowercase letters preceded by “the” and starting with “d” in /home/rookie/Python/moby.txt. The word is the password for next task!

Hint: read the file content using the following code, then you can search into variable data

with open('/home/rookie/Python/moby.txt','r') as f: 
    data = f.read()