Regular expressions (regexp) are a fundamental tool to quickly parse input and are very useful when interacting with other programs. We have already studied regular expressions in a previous class on “Stream editor and regular expressions” (slides).
Simple searches
In python, regular expressions are implemented in the re module. The function re.search(r,s) looks for a match of regular expression r into a string s. For example:
>>> import re
>>> re.search('([a-z]+)([0-9]+)','hello12345world')
<_sre.SRE_Match object; span=(0, 10), match='hello12345'>
Explanation:re.search looks for a match of regexp ([a-z]+)([0-9]+) into string hello12345world. The regular expression represents any sequence of at least one lowercase char followed by at least one digit. The search is successful and a Match object is returned. We can notice that position (span) of the matched string is (0,10) meaning that the matched chars are the ones from index 0 to index 9 (recall that, in python, the ending boundary 10 is excluded from the range), and the actual matched string is hello12345.
Groups
In order to extract the substring that matches a specific part of a regexp we use groups. Groups are enclosed in brackets. For example, ([a-z]+)([0-9]+) has two groups, one for the lowercase chars [a-z]+ and one for the digits [0-9]+. Function groups(), returns a tuple of the strings matching the groups:
>>> re.search('([a-z]+)([0-9]+)','hello12345world').groups()
('hello', '12345')
Explanation: The two substring matching the two groups are: hello and 12345.
With function group(n) we can retrieve the single matching string. When n is 0 we get the full matching string, otherwise we get the n-th group. Function span(n) returns the position in the full string:
>>> re.search('([a-z]+)([0-9]+)','hello12345world').group(0)
'hello12345'
>>> re.search('([a-z]+)([0-9]+)','hello12345world').group(1)
'hello'
>>> re.search('([a-z]+)([0-9]+)','hello12345world').span(1)
(0, 5)
>>> re.search('([a-z]+)([0-9]+)','hello12345world').group(2)
'12345'
>>> re.search('([a-z]+)([0-9]+)','hello12345world').span(2)
(5, 10)
Explanation:
group(0)returns the full matching stringhello12345group(1)returns the first grouphellospan(1)returns the position ofhelloinhello12345, i.e. from index 0 to index 4 (0,5)group(2)returns the first group12345span(2)returns the position of12345inhello12345, i.e. from index 5 to index 9 (5,10)
Exercise
Use a python regexp to find the only word of 5 lowercase letters preceded by “the” and starting with “d” in /home/rookie/Python/moby.txt. The word is the password for next task!
Hint: read the file content using the following code, then you can search into variable data
with open('/home/rookie/Python/moby.txt','r') as f:
data = f.read()