Python Re
Python Re
1
Python: Regular Expressions
Bruce Beckles
University Computing Service
Bob Dowling
Scientiic Computing Support e!mail a""ress:
escience!support#ucs$cam$ac$uk
?elcome to the University Computing Service9s 'Python: Regular Expressions(
course$
%he oicial UCS e!mail a""ress or all scientiic computing support Gueries) inclu"ing
any Guestions about this course) is:
escience-supportTucs.cam.ac.uk
2
2
%his course:
basic regular expressions
getting Python to use them
¬her UCS course:
Pattern Matching Using Regular Expressions
more powerul regular expressions
eicient regular expressions
Beore we start) let9s speciy ,ust what is an" isn9t in this course$
%his course is a very simple) beginner9s course on regular expressions$ *t mostly
covers how to get Python to use them$ * you want to learn the ull power o regular
expressions go to the UCS two!aternoon course calle" 'Pattern 7atching Using
Regular Expressions( which will cover them in "etail) but in a way that "oesn+t ocus
on any particular language$ 1or urther "etails o this course see the course
"escription at:
$ttp:GGtrainin#.csx.cam.ac.ukGcourseGre#ex
%here is an on!line intro"uction calle" the Python 'Regular Expression .ow%o( at:
$ttp:GGwww.amk.caGpyt$onG$owtoGre#exG
an" the ormal Python "ocumentation at
$ttp:GGdocs.pyt$on.or#GliBraryGre.$tml
%here is a goo" book on regular expressions in the 89Reilly series calle" '7astering
Regular Expressions( by Qerey E$ 1$ 1rei"l$ Be sure to get the thir" e"ition 4or later5
as its author has a""e" a lot o useul inormation since the secon" e"ition$ %here are
"etails o this book at:
$ttp:GGre#ex.infoG
$ttp:GGwww.oreilly.comGcatalo#Gre#exEG
%here is also a ?ikipe"ia page on regular expressions which has useul inormation
itsel burie" within it an" a urther set o reerences at the en":
$ttp:GGen.wikipedia.or#GwikiG,e#ularU-xpression
3
3
& regular expression is a
'pattern( "escribing some text:
'a series o "igits(
'a lower case letter ollowe"
by some "igits(
'a mixture o characters except or
new line) ollowe" by a ull stop an"
one or more letters or numbers(
\d+
[a-z]\d+
.+\.\w+
& regular expression is simply some means to write "own a pattern "escribing some
text$ 4%here is a ormal mathematical "einition but we9re not bothering with that here$
?hat the computing worl" calls regular expressions an" what the strict mathematical
grammarians call regular expressions are slightly "ierent things$5
1or example we might like to say 'a series o "igits( or a 'a single lower case letter
ollowe" by some "igits($ %here are terms in regular expression language or all o
these concepts$
4
4
& regular expression is a
'pattern( "escribing some text:
\d+
[a-z]\d+
.+\.\w+
*sn+t this ,ust gibberish-
%he language o
regular expressions
?e will cover what this means in a ew sli"es time$ ?e will start with a 'trivial( regular
expression) however) which simply matches a ixe" bit o text$
5
5
Classic regular expression ilter
or each line in a ile
"oes the line match a pattern-
i it "oes) output something
how can we tell-
what-
'.ey/ Something matche"/(
%he line that matche"
%he bit o the line that matche"
%his is a course on using regular expressions rom Python) so beore we intro"uce
even our most trivial expression we shoul" look at how Python "rives the regular
expression system$
8ur basic script or this course will run through a ile) a line at a time) an" compare the
line against some regular expression$ * the line matches the regular expression the
script will output something$ %hat 'something( might be ,ust a notice that it happene"
4or a line number) or a count o lines matche") etc$5 or it might be the line itsel$
1inally) it might be ,ust the bit o the line that matche"$
Programs like this) that pro"uce a line o output i a line o input matches some
con"ition an" no line o output i it "oesn+t are calle" RiltersR$
6
6
%ask: 0ook or '1re"( in a list o names
&lice
Bob
Charlotte
Derek
Ermintru"e
1re"
1re"a
1re"erick
1elicity
2
names.txt
1re"
1re"a
1re"erick
freds.txt
So we will start with a script that looks or the ixe" text 'Fred( in the ile names.txt$
1or each line that matches) the line is printe"$ 1or each line that "oesn+t nothing is
printe"$
7
7
c$$ grep
$ grep 'Fred' < names.txt
Fred
Freda
Frederick
$
%his is eGuivalent to the tra"itional Unix comman") #rep$
8
8
Skeleton Python script
import
for line in sys.stdin:
if
regular expression matches
:
sys.stdout.write(line
set up regular expression
to get st"in 3 st"out
rea" in the lines
one at a time
write out the
matching lines
import sys
regular expression module
compare line to regular expression
define pattern
So we will start with the outline o a Python script an" review the non!regular
expression lines irst$
Because we are using stan"ar" input an" stan"ar" output) we will import the sys
mo"ule to give us sys.stdin an" sys.stdout$
?e will process the ile a line at a time$ %he Python ob,ect sys.stdin correspon"s
to the stan"ar" input o the program an" i we use it like a list) as we "o here) then it
behaves like this list o lines$ So the Python Rfor line in sys.stdinR sets up a
or loop running through a line at a time) setting the variable line to be one line o the
ile ater another as the loop repeats$ %he loop en"s when there are no more lines in
the ile to rea"$
%he if statement simply looks at the results o the comparison to see i it was a
successul comparison or this particular value o line or not$
%he sys.stdout.write( line in the script simply prints the line$ ?e coul" ,ust use
print but we will use sys.stdout or symmetry with sys.stdin$
%he pseu"o!script on the sli"e contains all the non!regular!expression co"e reGuire"$
?hat we have to "o now is to ill in the rest$
9
9
Skeleton Python script
import
for line in sys.stdin:
if regular expression matches:
sys.stdout.write(line
set up regular expression
import sys
regular expression module
compare line to regular expression
're(
prepare the
reg$ exp$
use the
reg$ exp$
see what
we got
define pattern 'gibberish(
;ow let+s look at the regular expression lines we nee" to complete$
%he regular expression mo"ule in Python is calle" 're() so we will nee" the line
'import re( to loa" what we nee"$
*n Python we nee" to set up the regular expression in a"vance o using it$ 4&ctually
that+s not always true but this pattern is more lexible an" more eicient so we+ll ocus
on it in this course$5
1inally) or each line we rea" in we nee" some way to "etermine whether our regular
expression matches that line or not$
10
10
Skeleton Python script
import
import sys
re
Rea"y to use
regular expressions
for line in sys.stdin:
if
regular expression matches
:
sys.stdout.write(line
set up regular expression
compare line to regular expression
define pattern
So we start by importing the regular expressions mo"ule) re$
11
11
Deining the pattern
pattern ! "Fred"
Simple string
*n this very simple case o looking or an exact string) the pattern is simply that string$
So) given that we are looking or R1re"R) we set the pattern to be R1re"R$
12
12
Skeleton Python script
import
import sys
re
for line in sys.stdin:
if
regular expression matches
:
sys.stdout.write(line
set up regular expression
compare line to regular expression
pattern ! "Fred"
Deine the pattern
*n our skeleton script) though) we simply "eine a string$
%his is ,ust a Python string$ ?e still nee" to turn it into something that can "o the
searching or R1re"R$
13
13
Setting up a regular expression
re#exp ! re compile pattern
rom the re mo"ule
compile the pattern
'1re"(
regular expression ob,ect
. (
;ext we nee" to look at how to use a unction rom this mo"ule to set up a regular
expression ob,ect in Python rom that simple string$
%he re mo"ule has a unction 'compile(( which takes this string an" creates an
ob,ect Python can "o something with$ %his is "eliberately the same wor" as we use
or the processing o source co"e 4text5 into machine co"e 4program5$ .ere we are
taking a pattern 4text5 an" turning it into the mini!program that "oes the testing$
%he result o this compilation is a 'regular expression ob,ect() the mini program that
will "o work relevant to the particular pattern 'Fred($ ?e store this in a variable)
're#exp() so we can use it later in the script$
14
14
Skeleton Python script
import
import sys
re
Prepare the
regular
expression
for line in sys.stdin:
if
regular expression matches
:
sys.stdout.write(line
compare line to regular expression
pattern ! "Fred"
re#exp ! re.compile(pattern
So we put that compilation line in our script instea" o our placehol"er$
;ext we have to apply that regular expression ob,ect) re#exp) to each line as it is
rea" in to see i the line matches$
15
15
Using a regular expression
result ! re#exp line ( searc$ . !
%he reg$ exp$ ob,ect
we ,ust prepare"$
%he reg$ exp$ ob,ect+s
search45 metho"$
%he text being teste"$
%he result o the test$
?e start by "oing the test an" then we will look at the test+s results$
%he regular expression ob,ect that we have ,ust create") 're#exp() has a metho" 4a
built in unction speciic to itsel5 calle" 'searc$(($ So to reerence it in our script we
nee" to reer to 're#exp.searc$(($ %his metho" takes the text being teste" 4our
input line in this case5 as its only argument$ %he input line in in variable line so we
nee" to run 're#exp.searc$(line( to get our result$
;ote that the string 'Fred( appears nowhere in this line$ *t is built in to the re#exp
ob,ect$
*nci"entally) there is a relate" conusingly similar metho" calle" 'matc$(($ Don+t use
it$ 4&n" that+s the only time it will be mentione" in this course$5
16
16
Skeleton Python script
import
import sys
re
Use the
reg$ exp$
for line in sys.stdin:
if
regular expression matches
:
sys.stdout.write(line
pattern ! "Fred"
re#exp ! re.compile(pattern
result ! re#exp.searc$(line
So we put that search line in our script instea" o our placehol"er$
;ext we have to test the result to see i the search was successul$
17
17
%esting a regular expression+s results
%he result o the
regular expression+s
search45 metho"$
if result:
Search successul:
tests as True
Search unsuccessul:
tests as False
%he searc$( metho" returns the Python 'null ob,ect() *one, i there is no match an"
something else 4which we will return to later5 i there is one$ So the result variable
now reers to whatever it was that searc$( returne"$
*one is Python9s way o representing 'nothing($ %he if test in Python treats *one as
False an" the 'something else( as 6rue so we can use result to provi"e us with a
simple test$
18
18
Skeleton Python script
import
import sys
re
See i the
line matche"
for line in sys.stdin:
if result:
sys.stdout.write(line
pattern ! "Fred"
re#exp ! re.compile(pattern
result ! re#exp.searc$(line
So i we "rop that line into our skeleton Python script we have complete" it$
%his Python script is the airly generic ilter$ * a input line matches the pattern write the
line out$ * it "oesn+t "on+t write anything$
?e will only see two variants o this script in the entire course: in one we only print out
certain parts o the line an" in the other we allow or there being multiple 1re"s in a
single line$
19
19
Exercise 6: complete the ile
import sys
import re
pattern ! "Fred"
re#exp ! %
for line in sys.stdin:
result ! %
if result:
sys.stdout.write(line
filter&'.py
* you look in the "irectory prepare" or you) you will in" a Python script calle"
'filter&'.py( which contains ,ust this script with a couple o critical lines missing$
Xour irst exercise is to e"it that ile to make it a search or the string +Fred+$
20
20
Exercise 6: test your ile
$ python filter01.py < names.txt
Fred
Freda
Frederick
8nce you have e"ite" the script give it a try$ ;ote that three names match the test
pattern: 1re") 1re"a an" 1re"erick$ * you "on+t get this result go back to the script an"
correct it$
21
21
Case sensitive matching
names.txt
1re"
1re"a
1re"erick
7anre"
\
['@E ]
So this is a slight change nee"e" to our regular expression language to support multi!
line regular expressions$
60
60
Comments
:
[/-A][a-z]<@=
['@E][&-;]\
\d\d:\d\d:\d\d\
noet$er\ss$d
\[\d+\]:\
(nFalid\user\
\0+\
from\
\d<'?E=\.\d<'?E=\.\d<'?E=\.\d<'?E=
$
\ I 1ont$
Signiicant space
*gnore" space
Comment
;ow let+s a"" comments$
?e will intro"uce them using exactly the same character as is use" in Python proper)
the 'hash( character) 'I($
&ny text rom the hash character to the en" o the line is ignore"$
%his means that we will have to have some special treatment or hashes i we want to
match them as or"inary characters) o course$ *t+s time or another backslash$
61
61
.ashes in verbose mo"e
.ash intro"uces a comment
Backslash hash matches 'T(
Backslash not nee"e" in L2M
I 1ont$
\I
['@EI]
*n multi!line mo"e) hashes intro"uce comments$ %he backslashe" hash) '\I()
matches the hash character itsel$ &gain) ,ust as with space) you "on+t nee" the
backslash insi"e sGuare brackets$
62
62
:
[/-A][a-z]<@=\ I 1ont$
['@E][&-;]\ I 7ay
\d\d:\d\d:\d\d\ I 6ime
noet$er\ss$d
\[\d+\]:\ I 4rocess (7
(nFalid\user\
\0+\ I 2ser (7
from\
\d<'?E=\.\d<'?E=\.\d<'?E=\.\d<'?E= I (4 address
$
Eerbose mo"e
So this gives us our more legible mo"e$ Each element o the regular expression gets a
line to itsel so it at least looks like smaller pieces o gibberish$ 1urthermore each can
have a comment so we can be remin"e" o what the ragment is trying to match$
63
63
:
[/-A][a-z]<@=\ I 1ont$
['@E][&-;]\ I 7ay
\d\d:\d\d:\d\d\ I 6ime
noet$er\ss$d
\[\d+\]:\ I 4rocess (7
(nFalid\user\
\0+\ I 2ser (7
from\
\d<'?E=\.\d<'?E=\.\d<'?E=\.\d<'?E= I (4 address
$
Python long strings
r"""
"""
Start raw long string
En" long string
%his verbose regular expression pattern covers many lines$ Python has a mechanism
speciically "esigne" or multi!line strings: the triple!Guote" string$ ?e always use that)
in con,unction with the r 4'raw(5 Gualiier to carry these verbose regular expression
patterns$
64
64
%elling Python to 'go verbose(
Eerbose mo"e
¬her option) like ignoring case
7o"ule constant) like
re.J-,K+0- re.L
re.()*+,-./0-
So now all we have to "o is to tell Python to use this verbose mo"e instea" o its usual
one$ ?e "o this as an option on the re.compile( unction ,ust as we "i" when we
tol" it to work case insensitively$ %here is a Python mo"ule constant re.J-,K+0-
which we use in exactly the same way as we "i" re.()*+,-./0-$ *t has a chort
name 're.L( too$
*nci"entally) i you ever wante" case insensitivity an" verbosity) you a"" the two
together:
re#exp ! re.compile(pattern? re.()*+,-./0-+re.J-,K+0-
65
65
import sys
import re
%
pattern !
re#exp ! re.compile(pattern
%
import sys
import re
%
pattern !
[/-A][a-z]<@=
$
re#exp ! re.compile(pattern?
%
" r":[/-A][a-z]<@=%$
r""":
\
%
"""
re.J-,K+0-
6 : =
So how woul" we change a ilter script in practice to use verbose regular expressions-
*t+s actually a very easy three step process$
6$1irst you convert your pattern string into a multi!line string$
:$%hen you make the backslash tweaks necessary$
=$%hen you change the re.compile( call to have the re.J-,K+0- option$
4&n" then you shoul" test your script to see i it still works/5
66
66
Exercise D: use verbose mo"e
filter&@.py
single line
regular
expression
verbose
regular
expression
filter&E.py
So now that you9ve seen how to turn an 'or"inary( regular expression into a 'verbose(
one with comments) it+s time to try it or real$
E"it the iles filter&@.py an" filter&E.py so that the regular expression
patterns they use are 'verbose( ones lai" out across multiple lines with suitable
comments$ %est them against the same input iles as beore$
* you have any problems with this exercise) please ask the lecturer$
67
67
[a-z]+\.dat
\d<8=
: I 0tart of line
,2*\
I MoB numBer
\ .+145-6-7\.\ +26426\ (*\ F(5-\
I File name
\.
$ I -nd of line
Extracting bits rom the line
Suppose we wante" to extract
,ust these two components$
?e+re almost inishe" with the regular expression syntax now$ ?e have most o what
we nee" or this course an" can now get on with "eveloping Python+s system or using
it$ ?e will continue to use the verbose version o the regular expressions as it is easier
to rea") which is helpul or courses as well as or real lie/ ;ote that nothing we teach
in the remain"er o this course is speciic to verbose mo"eY it will all work eGually well
in the concise mo"e too$
Suppose we are particularly intereste" in two parts o the line) the ,ob number an" the
ile name$ ;ote that the ile name inclu"es both the component that varies rom line to
line) '[a z]+ () an" the constant) ixe" suix) '.dat($
?hat we will "o is label the two components in the pattern an" then look at Python+s
mechanism to get at their values$
68
68
Changing the pattern
: I 0tart of line
,2*\
\d<8= I MoB numBer
\ .+145-6-7\.\ +26426\ (*\ F(5-\
[a-z]+\.dat I File name
\.
$ I -nd of line
Parentheses aroun" the patterns
(
(
'@roups(
?e start by changing the pattern to place parentheses 4roun" brackets5 aroun" the
two components o interest$ %hese two selecte" areas are calle" 'groups($
69
69
Parentheses in regular expressions
Parentheses surroun" a group
Use backslash to match '4( or '5(
Backslash not nee"e" in L2M
(%
\(
['@E(]
\
Parentheses without backslashes must 'balance(
* you want to match a literal parenthesis use '\(( or '\($
;ote that because 4unbackslashe"5 parentheses have this special meaning o "eining
subsets o the matching line they must match$ * they "on+t then the re.compile(
unction will give an error similar to this:
QQQ pattern('*'
QQQ regexp(re.)ompile*pattern+
6raceBack (most recent call last:
File "PstdinQ"? line '? in PmoduleQ
File "GusrGliB8OGpyt$on@.8Gre.py"? line '99? in compile
return Ucompile(pattern? fla#s
File "GusrGliB8OGpyt$on@.8Gre.py"? line @OE? in Ucompile
raise error? F I inFalid expression
sreUconstants.error: unBalanced parent$esis
QQQ
70
70
%he 'match ob,ect(
%
re#exp ! re.compile(pattern?re.J-,K+0-
for line in sys.stdin:
! re#exp.searc$(line
if result:
%
result
;ow we are asking or certain parts o the pattern to be specially treate" 4as 'groups(5
we must turn our attention to the result o the search to get at those groups$
%o "ate all we have "one with the results is to test them or truth or alsehoo": 'is
there something there or not-( ;ow we will "ig in more "eeply$
71
71
Using the match ob,ect
0ine:
,2* &&&&&' .+145-6-7. +26426
(* F(5- $ydro#en.dat.
result.#roup('
result.#roup(@
H&&&&&'H
H$ydro#en.datH
result.#roup(&
whole pattern
?e get at the groups rom the match ob,ect$ %he metho" result.#roup(' will
return the contents o the irst pair o parentheses an" the metho"
result.#roup(@ will return the content o the secon"$
&vi" Pythonistas will recall that Python usually counts rom Bero an" may won"er what
result.#roup(& gives$ %his returns whatever the entire pattern matche"$ *n our
case where our regular expression "eines the whole line 4: to $5 this is eGuivalent to
the whole line$
72
72
Putting it all together
%
re#exp ! re.compile(pattern?re.J-,K+0-
for line in sys.stdin:
result ! re#exp.searc$(line
if result:
sys.stdout.write("Ns\tNs\n" N (r
esult.#roup('? result.#roup(@
So now we can write out ,ust those elements o the matching lines that we are
intereste" in$
;ote that we still have to test the result variable to make sure that it is not *one
4i$e$ that the regular expression matche" the line at all5$ %his is what the if% test "oes
because *one tests alse$ ?e cannot ask or the #roup( metho" on *one because
it "oesn+t have one$ * you make this mistake you will get an error message:
/ttriBute-rror: H*one6ypeH oBRect $as no attriBute H#roupH
an" your script will terminate abruptly$
73
73
Ns Ns name address
Exercise J: limite" output
7o"iy the log ile ilter to output ,ust
the account name an" the *P a""ress$
filter&E.py
sys.stdout.write(" ? \n" N ( \t
filter&O.py
;ow try it or yourselves$ Xou have a ile filter&E.py which you create" to answer
an earlier exercise$ %his in"s the lines rom the messages ile which in"icate an
*nvali" user$ Copy this script to filter&O.py an" e"it it so that you "eine groups or
the account name 4matche" by \0+5 an" the *P a""ress 4matche" by
\d<'?E=\.\d<'?E=\.\d<'?E=\.\d<'?E=5$
%he bottom o the sli"e is a Guick remin"er o the string substitution syntax in Python$
%his will get you nicely tab!aligne" text$
74
74
0imitations o numbere" groups
%he problem:
*nsert a group &ll ollowing numbers change
'?hat was group number three again-(
%he solution: use names instea" o numbers
*nsert a group *t gets its own name
Use sensible names$
@roups in regular expressions are goo" but they+re not perect$ %hey suer rom the
sort o problem that creeps up on you only ater you+ve been "oing Python regular
expressions or a bit$
Suppose you "eci"e you nee" to capture another group within a regular expression$ *
it is inserte" between the irst an" secon" existing group) say) then the ol" group
number : becomes the new number =) the ol" = the new A an" so on$
%here+s also a problem that 're#exp.#roup(@( "oesn+t shout out what the secon"
group actually was$
%here+s a solution to this$ ?e will associate names with groups rather than ,ust
numbers$
75
75
D4PfilenameQ
(D4PRoBnumQ\d<8=
: I 0tart of line
,2*\
I MoB numBer
\ .+145-6-7\.\ +26426\ (*\ F(5-\
( [a-z]+\.dat I File name
\.
$ I -nd of line
;ame" groups
Speciying the name
& group name"
',obnum(
So how "o we "o this naming-
?e insert some a""itional controls imme"iately ater the open parenthesis$ *n general
in Python9s regular expression syntax '(D( intro"uces something special that may not
even be a group 4though in this case it is5$ ?e speciy the name with the rather biBarre
syntax 'D4P#roupnameQ($
76
76
;aming a group
(D4PfilenameQ[a-z]+\.dat
425
-P "eine a
name" group
U2V
group name
pattern
So the group is "eine" as usual by parentheses 4roun" brackets5$
;ext must come 'D4( to in"icate that we are han"ling a name" group$
%hen comes the name o the group in angle brackets$
1inally comes the pattern that actually "oes the matching$ ;one o the D4P%Q
business is use" or matchingY it is purely or naming$
77
77
Using the name" group
0ine:
,2* &&&&&' .+145-6-7. +26426
(* F(5- $ydro#en.dat.
result.#roup(HRoBnoH
result.#roup(HfilenameH
H&&&&&'H
H$ydro#en.datH
%o reer to a group by its name) you simply pass the name to the #roup( metho" as
a string$ Xou can still also reer to the group by its number$ So in the example here)
result.#roup(HRoBnoH is the same as result.#roup(') since the irst group
is name" 'RoBno($
78
78
Putting it all together W 6
%
pattern!r"""
:
,2*\
(D4PRoBnumQ\d<8= I MoB numBer
\ .+145-6-7\.\ +26426\ (*\ F(5-\
(D4PfilenameQ[a-z]+\.dat I File name
\.
$
"""
%
So i we e"it our filter&@.py script we can allocate group name 'RoBnum( to the
series o six "igits an" 'filename( to the ile name 4complete with suix '.dat(5$ %his
is all "one in the pattern string$
79
79
Putting it all together W :
%
re#exp ! re.compile(pattern?re.J-,K+0-
for line in sys.stdin:
result ! re#exp.searc$(line
if result:
sys.stdout.write("Ns\tNs\n" N r
esult.#roup(HRoBnumH? result.#roup(Hfi
lenameH
&t the bottom o the script we then mo"iy the output line to use the names o the
groups in the write statement$
80
80
Parentheses in regular expressions
;umbere" group
(%
;ame" group
(D4PnameQ%
So here9s a new use o parentheses 4without backslashes5$
81
81
Exercise >: use" name" groups
filter&O.py
numbere"
groups
name"
groups
;ow try it or yourselves$ &"apt the filter&O.py script to use name" groups 4with
meaningul group names5$ 7ake sure you test it to check it still works/
* you have any problems with this exercise) please ask the lecturer$
82
82
&mbiguous groups within
the regular expression
Dictionary:
GFarGliBGdictGwords
Reg$ exp$:
:([a-z]+([a-z]+$
Script:
filter&>.py
?hat part o the wor" goes in group 6)
an" what part goes in group :-
@roups are all well an" goo") but are they necessarily well!"eine"- ?hat happens i
a line can it into groups in two "ierent ways-
1or example) consi"er the list o wor"s in GFarGliBGdictGwords$ %he lower case
wor"s in this line all match the regular expression '[a-z]+[a-z]+( because it is a
series o lower case letters ollowe" by a series o lower case letters$ But i we assign
groups to these parts)
'([a-z]+([a-z]+() which part o the wor" goes into the irst group an" which in
the secon"-
Xou can in" out by running the script filter&>.py$
83
83
([a-z]+ ([a-z]+
'@ree"y( expressions
$
2
aa
aali
aardFar
aardFark
$
i
k
s
: $
%he irst group is
'gree"y( at the
expense o the
secon" group$
&im to avoi" ambiguity
python filter0#.py
Python9s implementation o regular expressions makes the irst group 'gree"y(Y the
irst group swallows as many letters as it can at the expense o the secon"$
%here is no guarantee that other languages9 implementations will "o the same)
though$ Xou shoul" always aim to avoi" this sort o ambiguity$
Xou can change the gree" o various groups with yet more use o the Guery character
but please note the ambiguity caution above$ * you in" yoursel wanting to play with
the gree"iness you+re almost certainly "oing something wrong at a "eeper level$
84
84
Reerring to numbere" groups
within a regular expression
:([a-z]+ \' $
7atches a
seGuence
o letters
@roup 6
7atches 'the same as group 6(
;ow that we have groups in our regular expression we can use them or more$ So ar
the bracketing to create groups has been purely labelling) to i"entiy sections we can
extract later$ ;ow we will use them in the expression itsel$
?e can use a backslash in ront o a number 4or integers rom 6 to >>5 to mean 'that
number group in the current expression($ So ':([a-z]+\'$( matches any string
which is the same seGuence o lower case letters repeate" twice$
85
85
Reerring to name" groups
within a regular expression
:
(D4P$alfQ[a-z]+
(D4!$alf
$
Creates a group) 'hal(
Refers to the group
Does not create a group
* we have given names to our groups) then we use the special Python syntax
'(D4!groupname( to mean 'the group groupname in the current expression($ So
':(D4PwordQ[a-z]+(D4!word$( matches any string which is the same
seGuence o lower case letters repeate" twice$
;ote that in this case the (D% expression "oes not create a groupY instea") it reers
to one that alrea"y exists$ 8bserve that there is no pattern language in that secon"
pair o parentheses$
86
86
Example
$ python filter0$.py
atlatl
BaBa
BeriBeri
BonBon
BooBoo
BulBul
%
%he ile filter&8.py "oes precisely this using a name" group$
* have no i"ea what hal o these wor"s mean$
87
87
Parentheses in regular expressions
Create a numbere" group
(%
Create a name" group
(D4PnameQ%
Reer to a name" group
(D!name
%his completes all the uses o parentheses in Python regular expressions that we are
going to meet in this course$ %here are more$
88
88
Exercise 6<
filter&8.py
1in" all the wor"s with the pattern ABABA
e$g$
entente
A
B
A
B
A
entente
e
nt
filter&3.py
Copy the script filter&8.py to filter&3.py an" e"it the latter to in" all the
wor"s with the orm &B&B&$ 4Call your groups 'a( an" 'b( i you are stuck or
meaningul names$
;ote that in or the example wor" on the sli"e) the & pattern ,ust happens to be one
letter long 4the lower case letter 'e(5) whilst the B pattern is two letters long 4the lower
case letter seGuence 'nt(5$
.int: 8n P?1 0inux the GFarGliBGdictGwords "ictionary contains C such wor"s$
;o) * have no i"ea what most o them mean) either$
89
89
7ultiple matches
Data:
Boil.txt
Dierent number o entries on each line:
Basic entry:
/r 93.E
?ant to unpick this mess
/r 93.E
,e >;&&.& ,a @&'&.&
S '&E@.& ,n @''.E ,$ E;89.&
;ow we will move on to a more powerul use o groups$ Consi"er the ile Boil.txt$
%his contains the boiling points 4in \elvin at stan"ar" pressure5 o the various
elements but it has lines with "ierent numbers o entries on them$ Some lines have a
single element]temperature pair) others have two) three) or our$ ?e will presume that
we "on+t know what the maximum per line is$
90
90
?hat pattern "o we nee"-
S '&E@.& ,n @''.E ,$ E;89.&
[/-A][a-z]D
'element(
\s+
white space
\d+\.\d+
'boil(
?e nee" it
multiple times
2but we "on+t
know how many
%he basic structure o each line is straightorwar") so long as we can have an arbitrary
number o instances o a group$
91
91
Elementary pattern
S '&E@.&
(D4PelementQ[/-A][a-z]D
\s+
(D4PBoilQ\d+\.\d+
7atches a single pair
?e start by buil"ing the basic pattern that will be repeate" multiple times$ %he basic
pattern contains two groups which isolate the components we want rom each repeat:
the name o the element an" the temperature$
;ote that because the pattern can occur anywhere in the line we "on+t use the ':]$(
pair$
92
92
Putting it all together W 6
%
pattern!r"""
(D4PelementQ[/-A][a-z]D
\s+
(D4PBoilQ\d+\.\d+
"""
re#exp ! re.compile(pattern?re.J-,K+0-
%
?e put all this together in a ile call filter&9.py$
8ur pattern matches one o the element name]boiling point pairs an" names the two
groups appropriately$ Because we know we "on9t have a line per pair we aren9t using
the :%$ anchors$
93
93
Putting it all together W :
%
for line in sys.stdin:
result ! re#exp.searc$(line
if result:
sys.stdout.write("Ns\tNs\n" N r
esult.#roup(HelementH? result.#roup(HB
oilH
&t the bottom o the script we print out whatever the two groups have matche"$
94
94
%ry that pattern
$ python filter0%.py < &oil.txt
/r 93.E
,e >;&&.&
S '&E@.&
%
/# @OE>.&
/u E'@;.&
1irst matching
case o each line
But only the first
?e will start by "ropping this pattern into our stan"ar" script) mostly to see what
happens$ %he script "oes generate some output) but the pattern only matches against
the start o the line$ *t inishes as soon as it has matche" once$
95
95
7ultiple matches
re#exp.searc$(line
returns a single match
returns a list o matches
finditer (line re#exp.
*t woul" be better calle"
searchiter45 but never min"
%he problem lies in our use o re#exp+s searc$( metho"$ *t returns a single
1atc$+BRect) correspon"ing to that irst instance o the pattern in the line$
%he regular expression ob,ect has another metho" calle" 'finditer(( which
returns a list o matches) one or each that it in"s in the line$ 4*t woul" be better calle"
'searc$iter(( but never min"$5
4&ctually) it "oesn+t return a list) but rather one o those Python ob,ects that can be
treate" like a list$ %hey+re calle" 'iterators( which is where the name o the metho"
comes rom$5
96
96
%
for line in sys.stdin:
res'lt ( regexp.sear)h*line+
if res'lt,
sys.stdout.write("Ns\tNs\n" N r
esult.#roup(HelementH? result.#roup(HB
oilH
%he original script
So) we return to our script an" observe that it currently uses searc$(to return a
single 1atc$+BRect an" tests on that ob,ect$
97
97
%
for line in sys.stdin:
res'lts ( regexp.finditer*line+
for res'lt in res'lts,
sys.stdout.write("Ns\tNs\n" N r
esult.#roup(HelementH? result.#roup(HB
oilH
%he change" script
%he pattern remains exactly the same$
?e change the line that calle" searc$( an" store" a single 1atc$+BRect or a line
that calls finditer( an" stores a list o 1atc$+BRects$
*nstea" o the if statement we have a for statement to loop through all o the
1atc$+BRects in the list$ 4* none are oun" it9s an empty list$5
98
98
Using in"iter45
$ python filter0-.py < &oil.txt
/r 93.E
,e >;&&.&
,a @&'&.&
%
/u E'@;.&
/t 8'&.&
(n @EO>.&
Every matching
case in each line
&n" it works/ %his time we get all the element]temperature pairs$
99
99
Exercise 66
filter'&.py
E"it the script so that text is split into one wor"
per line with no punctuation or spaces output$
$ python filter10.py < paragraph.txt
6$is
is
free
%
8ne last exercise in class$ %he ile filter'&.py that you have is a skeleton script
that nee"s lines complete"$ E"it the ile so that it can be use" to split incoming text
into in"ivi"ual wor"s) printing one on each line$ Punctuation shoul" not be printe"$
Xou may in" it useul to recall the "einition o '\w($
100
100
?e have covere" only
simple regular expressions/
Capable o much more/
?e have ocuse" on getting Python to use them$
UCS course:
Pattern Matching Using Regular Expressions
ocuses on the expressions themselves an" not on
the language using them$
&n" that+s it/
.owever) let me remin" you that this course has concentrate" on getting regular
expressions to work in Python an" has only intro"uce" regular expression syntax
where necessary to illustrate eatures in Python+s re mo"ule$ Regular expressions
are capable o much) much more an" the UCS oers a two aternoon course) 'Pattern
7atching Using Regular Expressions() that covers them in ull "etail$ 1or urther
"etails o this course see the course "escription at:
$ttp:GGtrainin#.csx.cam.ac.ukGcourseGre#ex
101
101
1inal exercise
Data:
atoms@.lo#
4exten"e" log ile carrying timing inormation5
1in" the total CPU time taken or all:
:$ unsuccessul runs
6$ successul runs
?e will en" with a inal exercise which you can either "o in the class room or take
away to try at home$ %he ile atoms@.lo# is similar to atoms.lo# except that the
lines have been exten"e" to inclu"e CPU timing inormation$ ?rite a script which
"etermines the total time taken or both the successul an" unsuccessul runs an"
which prints out both igures$
* you want to "o this exercise outsi"e o class) the atoms@.lo# ile is available on!
line at this UR0:
http:]]www!uxsup$csx$cam$ac$uk]courses]PythonRE]atoms:$log
So that you have some i"ea o whether or not your script is correct) here are the the
total times or successul an" unsuccessul runs:
6otal .42 seconds taken for successful runs: @3>.E>
6otal .42 seconds taken for unsuccessful runs: 9&8.';