COMP(2041|9044) 25T1 — Python Regular Expressions
[Link]
[Link] COMP(2041|9044) 25T1 — Python Regular Expressions 1 / 28
Regular Expression History Revisited
1950s mathematician Stephen Kleene develops theory
1960s Ken Thompson develops syntax and practical implementation, two versions:
POSIX Basic Regular Expressions
limited syntax, e.g no |
used by grep & sed
needed when computers were every slow to make regex matching faster
POSIX Extended Regular Expressions - superset of Basic Regular Expressions
used by grep -E & sed -E
1980s Henry Spencer produces open source regex C library
used many place e.g. postgresql, tcl
extended (added features & syntax) to Ken’s regex language.
1987 Perl (Larry Wall) copied Henry’s library & extended much further
available outside Perl via Perl Compatible Regular Expressions library
used by grep -P
1990s Python standard re package also copied Henry’s library
added most of the features in Perl/PCRE
many commonly used features are common to both
we will cover some (not all) useful extra regex features found in both Python & Perl/PCRE
note [Link] lets you specify which regex language
[Link] COMP(2041|9044) 25T1 — Python Regular Expressions 2 / 28
Python re package - useful functions
[Link](regex, string, flags)
# search for a *regex* match within *string*
# return object with information about match or `None` if match fails
# optional parameter flags modifies matching,
# e.g. make matching case-insensitive with: `flags=re.I`
[Link](regex, string, flags)
# only match at start of string
# same as `[Link]` stating with `^`
[Link](regex, string, flags)
# only match the full string
# same as `[Link]` stating with `^` and ending with `$`
[Link] COMP(2041|9044) 25T1 — Python Regular Expressions 3 / 28
Python re package - useful functions
[Link](regex, replacement, string, count, flags)
# return *string* with anywhere *regex* matches, substituted by *replacement*
# optional parameter *count*, if non-zero, sets maximum number of
↪ substitutions
[Link](regex, string, flags)
# return all non-overlapping matches of pattern in string
# if pattern contains () return part matched by ()
# if pattern contains multiple () return tuple
[Link](regex, string, maxsplit, flags)
# Split *string* everywhere *regex* matches
# optional parameter *maxsplit*, if non-zero, set maximum number of splits
[Link] COMP(2041|9044) 25T1 — Python Regular Expressions 4 / 28
Python Characters Classes (also in PCRE)
\d matches any digit, for ASCII: [0-9]
\D matches any non-digit, for ASCII: [^0-9]
\w matches any word char, for ASCII: [a-zA-Z_0-9]
\W matches any non-word char, for ASCII: [^a-zA-Z_0-9]
\s matches any whitespace, for ASCII: [ \t\n\r\f]
\S matches any non-whitespace, for ASCII: [^ \t\n\r\f]
\b matches at a word boundary
\B matches except at a word boundary
\A matches at the start of the string, same as ^
\Z matches at the end of the string, same as $
convenient and make your regex more likely to be portable to non-English locales
\b and \B are like ^ and $ - they don’t match characters, they anchor the match
[Link] COMP(2041|9044) 25T1 — Python Regular Expressions 5 / 28
raw strings
Python raw-string is prefixed with an r (for raw)
can prefix with r strings quoted with ' " ''' """
backslashes have no special meaning in raw-string except before quotes
backslashes escape quotes but also stay in the string
regexes often contain backslashes - using raw-strings makes them more readable
>>> print('Hello\nAndrew')
Hello
Andrew
>>> print(r'Hello\nAndrew')
Hello\nAndrew
>>> r'Hello\nAndrew' == 'Hello\\nAndrew'
True
>>> len('\n')
1
>>> len(r'\n')
2
[Link] COMP(2041|9044) 25T1 — Python Regular Expressions 6 / 28
Match objects
[Link], [Link], [Link] return a match object if a match suceeds, None if it fails
hence their return can to control if or while
print("Destroy the file system? ")
answer = input()
if [Link](r'yes|ok|affirmative', answer, flags=re.I):
[Link]("rm -r /", Shell=True)
the match object can provide useful information:
>>> m = [Link](r'[aiou].*[aeiou]', 'pillow')
>>> m
<[Link] object; span=(1, 5), match='illo'>
>>> [Link](0)
'illo'
>>> [Link]()
(1, 5)
>>>
[Link] COMP(2041|9044) 25T1 — Python Regular Expressions 7 / 28
Capturing Parts of a Regex Match
brackets are used for grouping (like arithmetic) in extened regular expresions
in Python (& PCRE) brackets also capture the part of the string matched
group(n) returns part of the string matched by the nth-pair of brackets
>>> m = [Link]('(\w+)\s+(\w+)', 'Hello Andrew')
>>> [Link]()
('Hello', 'Andrew')
>>> [Link](1)
'Hello'
>>> [Link](2)
'Andrew'
\number can be used to refer to group number in an [Link] replacement string
>>> [Link](r'(\d+) and (\d+)', r'\2 or \1', "The answer is 42 and 43?")
'The answer is 43 or 42?'
[Link] COMP(2041|9044) 25T1 — Python Regular Expressions 8 / 28
Back-referencing
\number can also be used in a regex as well
usually called a back-reference
e.g. r'^(\d+) (\1)$' match the same integer twice
>>> [Link](r'^(\d+) (\d+)$', '42 43')
<[Link] object; span=(0, 5), match='42 43'>
>>> [Link](r'^(\d+) (\1)$', '42 43')
>>> [Link](r'^(\d+) (\1)$', '42 42')
<[Link] object; span=(0, 5), match='42 42'>
back-references allow matching impossible with classical regular expressions
python supports up to 99 back-references, \1, \2, \3, …, \99
\01 or \100 is interpreted as an octal number
[Link] COMP(2041|9044) 25T1 — Python Regular Expressions 9 / 28
Non-Capturing Group
(?:...) is a non-capturing group
it has the same grouping behaviour as (...)
it doesn’t capture the part of the string matched by the group
>>> m = [Link](r'.*(?:[aeiou]).*([aeiou]).*', 'abcde')
>>> m
<[Link] object; span=(0, 5), match='abcde'>
>>> [Link](1)
'e'
[Link] COMP(2041|9044) 25T1 — Python Regular Expressions 10 / 28
Greedy versus non-Greedy Pattern Matching
The default semantics for pattern matching is greedy:
starts match the first place it can succeed
make the match as long as possible
The ? operator changes pattern matching to non-greedy:
starts match the first place it can succeed
make the match as short as possible
>>> s = "abbbc"
>>> [Link](r'ab+', 'X', s)
'Xc'
>>> [Link](r'ab+?', 'X', s)
'Xbbc'
[Link] COMP(2041|9044) 25T1 — Python Regular Expressions 11 / 28
Why Implementing a Regex Matching isn’t Easy
regex matching starts match the first place it can succeed
but a regex can partly match many places
>>> [Link](r'ab+c', 'X', "abbabbbbbbbabbbc")
'abbabbbbbbbX'
and may need to backtrack, e.g:
>>> [Link](r'a.*bc', 'X', "abbabbbbbbbcabbb")
'Xabbb'
poorly designed regex engines can get very slow
have been used for denial-of-service attacks
Python (PCRE) regex matching is NP-hard due to back-references
[Link] COMP(2041|9044) 25T1 — Python Regular Expressions 12 / 28
[Link]
[Link] returns a list of the matched strings, e.g:
>>> [Link](r'\d+', "-5==10zzz200_")
['5', '10', '200']
if the regex contains () only the captured text is returned
>>> [Link](r'(\d)\d*', "-5==10zzz200_")
['5', '1', '2']
if the regex contains multiple () a list of tuples is returned
>>> [Link](r'(\d)\d*(\d)', "-5==10zzz200_")
[('1', '0'), ('2', '0')]
>>> [Link](r'([^,]*), (\S+)', "Hopper, Grace Brewster Murray")
[('Hopper', 'Grace')]
>>> [Link](r'([A-Z])([aeiou])', "Hopper, Grace Brewster Murray")
[('H', 'o'), ('M', 'u')]
[Link] COMP(2041|9044) 25T1 — Python Regular Expressions 13 / 28
[Link]
[Link] splits a string where a regex matches
>>> [Link](r'\d+', "-5==10zzz200_")
['-', '==', 'zzz', '_']
like cut in Shell scripts - but more powerful
for example, you can’t do this with cut
>>> [Link](r'\s*,\s*', "abc,de, ghi ,jk , mn")
['abc', 'de', 'ghi', 'jk', 'mn']
see also the string join function
>>> a = [Link](r'\s*,\s*', "abc,de, ghi ,jk , mn")
>>> a
['abc', 'de', 'ghi', 'jk', 'mn']
>>> ':'.join(a)
'[Link]ghi:jk:mn'
[Link] COMP(2041|9044) 25T1 — Python Regular Expressions 14 / 28
Example - printing the last number
# Print the last number (real or integer) on every line
# Note: regexp to match number: -?\d+\.?\d*
# Note: use of assignment operator :=
import re, sys
for line in [Link]:
if m := [Link](r'(-?\d+\.?\d*)\D*$', line):
print([Link](1))
source code for print_last_number.py
[Link] COMP(2041|9044) 25T1 — Python Regular Expressions 15 / 28
Example - finding numbers #0
# print the sum and mean of any positive integers found on stdin
# Note regexp to split on non-digits
# Note check to handle empty string from split
# Only positive integers handled
import re, sys
input_as_string = [Link]()
numbers = [Link](r"\D+", input_as_string)
total = 0
n = 0
for number in numbers:
if number:
total += int(number)
n += 1
if numbers:
print(f"{n} numbers, total {total}, mean {total / n:.1f}")
source code for find_numbers.[Link]
[Link] COMP(2041|9044) 25T1 — Python Regular Expressions 16 / 28
Example - finding numbers #1
# print the sum and mean of any numbers found on stdin
# Note regexp to match number -?\d+\.?\d*
# match postive & negative integers & floating-point numbers
import re, sys
input_as_string = [Link]()
numbers = [Link](r"-?\d+\.?\d*", input_as_string)
n = len(numbers)
total = sum(float(number) for number in numbers)
if numbers:
print(f"{n} numbers, total {total}, mean {total / n:.1f}")
source code for find_numbers.[Link]
[Link] COMP(2041|9044) 25T1 — Python Regular Expressions 17 / 28
Example - counting enrollments with regexes & dicts
course_names = {}
with open(COURSE_CODES_FILE, encoding="utf-8") as f:
for line in f:
if m := [Link](r"(\S+)\s+(.*\S)", line):
course_names[[Link](1)] = [Link](2)
enrollments_count = {}
with open(ENROLLMENTS_FILE, encoding="utf-8") as f:
for line in f:
course_code = [Link](r"\|.*\n", "", line)
if course_code not in enrollments_count:
enrollments_count[course_code] = 0
enrollments_count[course_code] += 1
for (course_code, enrollment) in sorted(enrollments_count.items()):
# if no name for course_code use ???
name = course_names.get(course_code, "???")
print(f"{enrollment:4} {course_code} {name}")
source code for count_enrollments.[Link]
[Link] COMP(2041|9044) 25T1 — Python Regular Expressions 18 / 28
Example - counting enrollments with split & counters
course_names = {}
with open(COURSE_CODES_FILE, encoding="utf-8") as f:
for line in f:
course_code, course_name = [Link]().split("\t", maxsplit=1)
course_names[course_code] = course_name
enrollments_count = [Link]()
with open(ENROLLMENTS_FILE, encoding="utf-8") as f:
for line in f:
course_code = [Link]("|")[0]
enrollments_count[course_code] += 1
for (course_code, enrollment) in sorted(enrollments_count.items()):
# if no name for course_code use ???
name = course_names.get(course_code, "???")
print(f"{enrollment:4} {course_code} {name}")
source code for count_enrollments.[Link]
[Link] COMP(2041|9044) 25T1 — Python Regular Expressions 19 / 28
Example - counting first names
already_counted = set()
first_name_count = [Link]()
with open(ENROLLMENTS_FILE, encoding="utf-8") as f:
for line in f:
_, student_number, full_name = [Link]("|")[0:3]
if student_number in already_counted:
continue
already_counted.add(student_number)
if m := [Link](r".*,\s+(\S+)", full_name):
first_name = [Link](1)
first_name_count[first_name] += 1
# put the count first in the tuples so sorting orders on count before name
count_name_tuples = [(c, f) for (f, c) in first_name_count.items()]
# print first names in decreasing order of popularity
for (count, first_name) in sorted(count_name_tuples, reverse=True):
print(f"{count:4} {first_name}")
source code for count_first_names.py
[Link] COMP(2041|9044) 25T1 — Python Regular Expressions 20 / 28
Example - finding duplicate first names using dict of dicts
course_first_name_count = {}
with open(ENROLLMENTS_FILE, encoding="utf-8") as f:
for line in f:
course_code, _, full_name = [Link]("|")[0:3]
if m := [Link](r".*,\s+(\S+)", full_name):
first_name = [Link](1)
else:
print("Warning could not parse line", [Link](),
↪ file=[Link])
continue
if course_code not in course_first_name_count:
course_first_name_count[course_code] = {}
if first_name not in course_first_name_count[course_code]:
course_first_name_count[course_code][first_name] = 0
course_first_name_count[course_code][first_name] += 1
for course in sorted(course_first_name_count.keys()):
for (first_name, count) in course_first_name_count[course].items():
if count >= REPORT_MORE_THAN_STUDENTS:
print(course, "has", count, "students named", first_name)
source code for duplicate_first_names.[Link]
[Link] COMP(2041|9044) 25T1 — Python Regular Expressions 21 / 28
Example - finding duplicate first names using split & defaultdict of counters
course_first_name_count = [Link]([Link])
with open(ENROLLMENTS_FILE, encoding="utf-8") as f:
for line in f:
course_code, _, full_name = [Link]("|")[0:3]
given_names = full_name.split(",")[1].strip()
first_name = given_names.split(" ")[0]
course_first_name_count[course_code][first_name] += 1
for (course, name_counts) in sorted(course_first_name_count.items()):
for (first_name, count) in name_counts.items():
if count > REPORT_MORE_THAN_STUDENTS:
print(course, "has", count, "students named", first_name)
source code for duplicate_first_names.[Link]
[Link] COMP(2041|9044) 25T1 — Python Regular Expressions 22 / 28
Example - Changing Filenames with Regex
# written by andrewt@[Link] for COMP(2041|9044)
#
# Change the names of the specified files
# by substituting occurrances of regex with replacement
# (simple version of the perl utility rename)
import os
import re
import sys
if len([Link]) < 3:
print(f"Usage: {[Link][0]} <regex> <replacement> [files]",
↪ file=[Link])
[Link](1)
regex = [Link][1]
replacement = [Link][2]
for old_pathname in [Link][3:]:
new_pathname = [Link](regex, replacement, old_pathname, count=1)
if new_pathname == old_pathname:
continue
if [Link](new_pathname):
print(f"{[Link][0]}: '{new_pathname}' exists", file=[Link])
continue
try:
[Link](old_pathname, new_pathname)
except OSError as e:
print(f"{[Link][0]}: '{new_pathname}' {e}", file=[Link])
source code for rename_regex.py
[Link] COMP(2041|9044) 25T1 — Python Regular Expressions 23 / 28
Example - Changing Filenames with Regex & EVal
# written by andrewt@[Link] for COMP(2041|9044)
#
# Change the names of the specified files
# by substituting occurrances of regex with replacement
# (simple version of the perl utility rename)
#
# also demonstrating argument processing and use of eval
# beware eval can allow arbitrary code execution,
# it should not be used where security is importnat
import argparse
import os
import re
import sys
parser = [Link]()
# add required arguments
parser.add_argument("regex", type=str, help="match against filenames")
parser.add_argument("replacement", type=str, help="replaces matches with
↪ this")
parser.add_argument("filenames", nargs="*", help="filenames to be changed")
# add some optional boolean arguments
parser.add_argument(
"-d", "--dryrun", action="store_true", help="show changes but don't make
↪ them"
)
parser.add_argument(
"-v", "--verbose", action="store_true", help="print more information"
)
parser.add_argument(
"-e",
"--eval",
action="store_true",
help="evaluate replacement as python expression, match available as _",
)
# optional integer argument which defaults to 1
parser.add_argument(
"-n",
"--replace_n_matches",
type=int,
default=1,
help="replace n matches (0 for all matches)",
)
args = parser.parse_args()
def eval_replacement(match):
"""if --eval given, evaluate replacment string as Python
with the variable _ set to the matching part of the filename
"""
if not [Link]:
return [Link]
_ = [Link](0)
return str(eval([Link]))
for old_pathname in [Link]:
try:
new_pathname = [Link](
[Link], eval_replacement, old_pathname,
↪ count=args.replace_n_matches
)
except OSError as e:
print(
f"{[Link][0]}: '{old_pathname}': '{[Link]}' {e}",
file=[Link],
)
continue
if new_pathname == old_pathname:
if [Link]:
print("no change:", old_pathname)
continue
if [Link](new_pathname):
print(f"{[Link][0]}: '{new_pathname}' exists", file=[Link])
continue
if [Link]:
print(old_pathname, "would be renamed to", new_pathname)
continue
if [Link]:
print("'renaming", old_pathname, "to", new_pathname)
try:
[Link](old_pathname, new_pathname)
except OSError as e:
print(f"{[Link][0]}: '{new_pathname}' {e}", file=[Link])
source code for rename_regex_eval.py
[Link] COMP(2041|9044) 25T1 — Python Regular Expressions 24 / 28
Example - When Harry Met Hermione #0
# For each file given as argument replace occurrences of Hermione
# allowing for some misspellings with Harry and vice-versa.
# Relies on Zaphod not occurring in the text.
import re, sys, os
for filename in [Link][1:]:
tmp_filename = filename + ".new"
if [Link](tmp_filename):
print(f"{[Link][0]}: {tmp_filename} already exists\n",
↪ file=[Link])
[Link](1)
with open(filename) as f:
with open(tmp_filename, "w") as g:
for line in f:
changed_line = [Link](r"Herm[io]+ne", "Zaphod", line)
changed_line = changed_line.replace("Harry", "Hermione")
changed_line = changed_line.replace("Zaphod", "Harry")
[Link](changed_line)
[Link](tmp_filename, filename)
source code for change_names.[Link]
[Link] COMP(2041|9044) 25T1 — Python Regular Expressions 25 / 28
Example - When Harry Met Hermione #1
# For each file given as argument replace occurrences of Hermione
# allowing for some misspellings with Harry and vice-versa.
# Relies on Zaphod not occurring in the text.
import re, sys, os, shutil, tempfile
for filename in [Link][1:]:
with [Link](mode='w', delete=False) as tmp:
with open(filename) as f:
for line in f:
changed_line = [Link](r"Herm[io]+ne", "Zaphod", line)
changed_line = changed_line.replace("Harry", "Hermione")
changed_line = changed_line.replace("Zaphod", "Harry")
[Link](changed_line)
[Link]([Link], filename)
source code for change_names.[Link]
[Link] COMP(2041|9044) 25T1 — Python Regular Expressions 26 / 28
Example - When Harry Met Hermione #2
# For each file given as argument replace occurrences of Hermione
# allowing for some misspellings with Harry and vice-versa.
# Relies on Zaphod not occurring in the text.
# modified text is stored in a list then file over-written
import re, sys, os
for filename in [Link][1:]:
changed_lines = []
with open(filename) as f:
for line in f:
changed_line = [Link](r"Herm[io]+ne", "Zaphod", line)
changed_line = changed_line.replace("Harry", "Hermione")
changed_line = changed_line.replace("Zaphod", "Harry")
changed_lines.append(changed_line)
with open(filename, "w") as g:
[Link]("".join(changed_lines))
source code for change_names.[Link]
[Link] COMP(2041|9044) 25T1 — Python Regular Expressions 27 / 28
Example - When Harry Met Hermione #3
# For each file given as argument replace occurrences of Hermione
# allowing for some misspellings with Harry and vice-versa.
# Relies on Zaphod not occurring in the text.
# modified text is stored in a single string then file over-written
import re, sys, os
for filename in [Link][1:]:
changed_lines = []
with open(filename) as f:
text = [Link]()
changed_text = [Link](r"Herm[io]+ne", "Zaphod", text)
changed_text = changed_text.replace("Harry", "Hermione")
changed_text = changed_text.replace("Zaphod", "Harry")
with open(filename, "w") as g:
[Link]("".join(changed_text))
source code for change_names.[Link]
[Link] COMP(2041|9044) 25T1 — Python Regular Expressions 28 / 28