Regular expressions (regexp) are a fundamental tool to quickly parse input and are very useful when interacting with other programs. We have already studied regular expressions in a previous class on “Stream editor and regular expressions” (slides).
Simple searches
In python, regular expressions are implemented in the re module. The function re.search(r,s)
looks for a match of regular expression r
into a string s
. For example:
>>> import re >>> re.search('([a-z]+)([0-9]+)','hello12345world') <_sre.SRE_Match object; span=(0, 10), match='hello12345'>
Explanation:re.search
looks for a match of regexp ([a-z]+)([0-9]+)
into string hello12345world
. The regular expression represents any sequence of at least one lowercase char followed by at least one digit. The search is successful and a Match object is returned. We can notice that position (span) of the matched string is (0,10)
meaning that the matched chars are the ones from index 0
to index 9
(recall that, in python, the ending boundary 10
is excluded from the range), and the actual matched string is hello12345
.
Groups
In order to extract the substring that matches a specific part of a regexp we use groups. Groups are enclosed in brackets. For example, ([a-z]+)([0-9]+)
has two groups, one for the lowercase chars [a-z]+
and one for the digits [0-9]+
. Function groups()
, returns a tuple of the strings matching the groups:
>>> re.search('([a-z]+)([0-9]+)','hello12345world').groups() ('hello', '12345')
Explanation: The two substring matching the two groups are: hello
and 12345
.
With function group(n)
we can retrieve the single matching string. When n
is 0
we get the full matching string, otherwise we get the n
-th group. Function span(n)
returns the position in the full string:
>>> re.search('([a-z]+)([0-9]+)','hello12345world').group(0) 'hello12345' >>> re.search('([a-z]+)([0-9]+)','hello12345world').group(1) 'hello' >>> re.search('([a-z]+)([0-9]+)','hello12345world').span(1) (0, 5) >>> re.search('([a-z]+)([0-9]+)','hello12345world').group(2) '12345' >>> re.search('([a-z]+)([0-9]+)','hello12345world').span(2) (5, 10)
Explanation:
group(0)
returns the full matching stringhello12345
group(1)
returns the first grouphello
span(1)
returns the position ofhello
inhello12345
, i.e. from index 0 to index 4 (0,5)group(2)
returns the first group12345
span(2)
returns the position of12345
inhello12345
, i.e. from index 5 to index 9 (5,10)
Exercise
Use a python regexp to find the only word of 5 lowercase letters preceded by “the” and starting with “d” in /home/rookie/Python/moby.txt
. The word is the password for next task!
Hint: read the file content using the following code, then you can search into variable data
with open('/home/rookie/Python/moby.txt','r') as f: data = f.read()