100% found this document useful (1 vote)
89 views10 pages

COMP2041 25T1: Python Regex Guide

Here are regex class notes from my school

Uploaded by

felixbakitsi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
89 views10 pages

COMP2041 25T1: Python Regex Guide

Here are regex class notes from my school

Uploaded by

felixbakitsi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

COMP(2041|9044) 25T1 — Python Regular Expressions

[Link]

[Link] COMP(2041|9044) 25T1 — Python Regular Expressions 1 / 28

Regular Expression History Revisited


1950s mathematician Stephen Kleene develops theory
1960s Ken Thompson develops syntax and practical implementation, two versions:
POSIX Basic Regular Expressions
limited syntax, e.g no |
used by grep & sed
needed when computers were every slow to make regex matching faster
POSIX Extended Regular Expressions - superset of Basic Regular Expressions
used by grep -E & sed -E
1980s Henry Spencer produces open source regex C library
used many place e.g. postgresql, tcl
extended (added features & syntax) to Ken’s regex language.
1987 Perl (Larry Wall) copied Henry’s library & extended much further
available outside Perl via Perl Compatible Regular Expressions library
used by grep -P
1990s Python standard re package also copied Henry’s library
added most of the features in Perl/PCRE
many commonly used features are common to both
we will cover some (not all) useful extra regex features found in both Python & Perl/PCRE
note [Link] lets you specify which regex language

[Link] COMP(2041|9044) 25T1 — Python Regular Expressions 2 / 28

Python re package - useful functions

[Link](regex, string, flags)


# search for a *regex* match within *string*
# return object with information about match or `None` if match fails
# optional parameter flags modifies matching,
# e.g. make matching case-insensitive with: `flags=re.I`

[Link](regex, string, flags)


# only match at start of string
# same as `[Link]` stating with `^`

[Link](regex, string, flags)


# only match the full string
# same as `[Link]` stating with `^` and ending with `$`

[Link] COMP(2041|9044) 25T1 — Python Regular Expressions 3 / 28


Python re package - useful functions

[Link](regex, replacement, string, count, flags)


# return *string* with anywhere *regex* matches, substituted by *replacement*
# optional parameter *count*, if non-zero, sets maximum number of
↪ substitutions

[Link](regex, string, flags)


# return all non-overlapping matches of pattern in string
# if pattern contains () return part matched by ()
# if pattern contains multiple () return tuple

[Link](regex, string, maxsplit, flags)


# Split *string* everywhere *regex* matches
# optional parameter *maxsplit*, if non-zero, set maximum number of splits

[Link] COMP(2041|9044) 25T1 — Python Regular Expressions 4 / 28

Python Characters Classes (also in PCRE)

\d matches any digit, for ASCII: [0-9]


\D matches any non-digit, for ASCII: [^0-9]
\w matches any word char, for ASCII: [a-zA-Z_0-9]
\W matches any non-word char, for ASCII: [^a-zA-Z_0-9]
\s matches any whitespace, for ASCII: [ \t\n\r\f]
\S matches any non-whitespace, for ASCII: [^ \t\n\r\f]
\b matches at a word boundary
\B matches except at a word boundary
\A matches at the start of the string, same as ^
\Z matches at the end of the string, same as $

convenient and make your regex more likely to be portable to non-English locales
\b and \B are like ^ and $ - they don’t match characters, they anchor the match

[Link] COMP(2041|9044) 25T1 — Python Regular Expressions 5 / 28

raw strings

Python raw-string is prefixed with an r (for raw)


can prefix with r strings quoted with ' " ''' """
backslashes have no special meaning in raw-string except before quotes
backslashes escape quotes but also stay in the string
regexes often contain backslashes - using raw-strings makes them more readable

>>> print('Hello\nAndrew')
Hello
Andrew
>>> print(r'Hello\nAndrew')
Hello\nAndrew
>>> r'Hello\nAndrew' == 'Hello\\nAndrew'
True
>>> len('\n')
1
>>> len(r'\n')
2

[Link] COMP(2041|9044) 25T1 — Python Regular Expressions 6 / 28


Match objects

[Link], [Link], [Link] return a match object if a match suceeds, None if it fails
hence their return can to control if or while

print("Destroy the file system? ")


answer = input()
if [Link](r'yes|ok|affirmative', answer, flags=re.I):
[Link]("rm -r /", Shell=True)

the match object can provide useful information:

>>> m = [Link](r'[aiou].*[aeiou]', 'pillow')


>>> m
<[Link] object; span=(1, 5), match='illo'>
>>> [Link](0)
'illo'
>>> [Link]()
(1, 5)
>>>

[Link] COMP(2041|9044) 25T1 — Python Regular Expressions 7 / 28

Capturing Parts of a Regex Match

brackets are used for grouping (like arithmetic) in extened regular expresions
in Python (& PCRE) brackets also capture the part of the string matched
group(n) returns part of the string matched by the nth-pair of brackets
>>> m = [Link]('(\w+)\s+(\w+)', 'Hello Andrew')
>>> [Link]()
('Hello', 'Andrew')
>>> [Link](1)
'Hello'
>>> [Link](2)
'Andrew'

\number can be used to refer to group number in an [Link] replacement string


>>> [Link](r'(\d+) and (\d+)', r'\2 or \1', "The answer is 42 and 43?")
'The answer is 43 or 42?'

[Link] COMP(2041|9044) 25T1 — Python Regular Expressions 8 / 28

Back-referencing

\number can also be used in a regex as well


usually called a back-reference
e.g. r'^(\d+) (\1)$' match the same integer twice

>>> [Link](r'^(\d+) (\d+)$', '42 43')


<[Link] object; span=(0, 5), match='42 43'>
>>> [Link](r'^(\d+) (\1)$', '42 43')
>>> [Link](r'^(\d+) (\1)$', '42 42')
<[Link] object; span=(0, 5), match='42 42'>

back-references allow matching impossible with classical regular expressions

python supports up to 99 back-references, \1, \2, \3, …, \99

\01 or \100 is interpreted as an octal number

[Link] COMP(2041|9044) 25T1 — Python Regular Expressions 9 / 28


Non-Capturing Group

(?:...) is a non-capturing group


it has the same grouping behaviour as (...)
it doesn’t capture the part of the string matched by the group

>>> m = [Link](r'.*(?:[aeiou]).*([aeiou]).*', 'abcde')


>>> m
<[Link] object; span=(0, 5), match='abcde'>
>>> [Link](1)
'e'

[Link] COMP(2041|9044) 25T1 — Python Regular Expressions 10 / 28

Greedy versus non-Greedy Pattern Matching

The default semantics for pattern matching is greedy:


starts match the first place it can succeed
make the match as long as possible
The ? operator changes pattern matching to non-greedy:
starts match the first place it can succeed
make the match as short as possible

>>> s = "abbbc"
>>> [Link](r'ab+', 'X', s)
'Xc'
>>> [Link](r'ab+?', 'X', s)
'Xbbc'

[Link] COMP(2041|9044) 25T1 — Python Regular Expressions 11 / 28

Why Implementing a Regex Matching isn’t Easy

regex matching starts match the first place it can succeed

but a regex can partly match many places

>>> [Link](r'ab+c', 'X', "abbabbbbbbbabbbc")


'abbabbbbbbbX'

and may need to backtrack, e.g:

>>> [Link](r'a.*bc', 'X', "abbabbbbbbbcabbb")


'Xabbb'

poorly designed regex engines can get very slow


have been used for denial-of-service attacks
Python (PCRE) regex matching is NP-hard due to back-references

[Link] COMP(2041|9044) 25T1 — Python Regular Expressions 12 / 28


[Link]

[Link] returns a list of the matched strings, e.g:


>>> [Link](r'\d+', "-5==10zzz200_")
['5', '10', '200']

if the regex contains () only the captured text is returned

>>> [Link](r'(\d)\d*', "-5==10zzz200_")


['5', '1', '2']

if the regex contains multiple () a list of tuples is returned

>>> [Link](r'(\d)\d*(\d)', "-5==10zzz200_")


[('1', '0'), ('2', '0')]
>>> [Link](r'([^,]*), (\S+)', "Hopper, Grace Brewster Murray")
[('Hopper', 'Grace')]
>>> [Link](r'([A-Z])([aeiou])', "Hopper, Grace Brewster Murray")
[('H', 'o'), ('M', 'u')]

[Link] COMP(2041|9044) 25T1 — Python Regular Expressions 13 / 28

[Link]

[Link] splits a string where a regex matches


>>> [Link](r'\d+', "-5==10zzz200_")
['-', '==', 'zzz', '_']

like cut in Shell scripts - but more powerful

for example, you can’t do this with cut

>>> [Link](r'\s*,\s*', "abc,de, ghi ,jk , mn")


['abc', 'de', 'ghi', 'jk', 'mn']

see also the string join function

>>> a = [Link](r'\s*,\s*', "abc,de, ghi ,jk , mn")


>>> a
['abc', 'de', 'ghi', 'jk', 'mn']
>>> ':'.join(a)
'[Link]ghi:jk:mn'

[Link] COMP(2041|9044) 25T1 — Python Regular Expressions 14 / 28

Example - printing the last number

# Print the last number (real or integer) on every line


# Note: regexp to match number: -?\d+\.?\d*
# Note: use of assignment operator :=
import re, sys
for line in [Link]:
if m := [Link](r'(-?\d+\.?\d*)\D*$', line):
print([Link](1))
source code for print_last_number.py

[Link] COMP(2041|9044) 25T1 — Python Regular Expressions 15 / 28


Example - finding numbers #0

# print the sum and mean of any positive integers found on stdin
# Note regexp to split on non-digits
# Note check to handle empty string from split
# Only positive integers handled
import re, sys
input_as_string = [Link]()
numbers = [Link](r"\D+", input_as_string)
total = 0
n = 0
for number in numbers:
if number:
total += int(number)
n += 1
if numbers:
print(f"{n} numbers, total {total}, mean {total / n:.1f}")
source code for find_numbers.[Link]

[Link] COMP(2041|9044) 25T1 — Python Regular Expressions 16 / 28

Example - finding numbers #1

# print the sum and mean of any numbers found on stdin


# Note regexp to match number -?\d+\.?\d*
# match postive & negative integers & floating-point numbers
import re, sys
input_as_string = [Link]()
numbers = [Link](r"-?\d+\.?\d*", input_as_string)
n = len(numbers)
total = sum(float(number) for number in numbers)
if numbers:
print(f"{n} numbers, total {total}, mean {total / n:.1f}")
source code for find_numbers.[Link]

[Link] COMP(2041|9044) 25T1 — Python Regular Expressions 17 / 28

Example - counting enrollments with regexes & dicts


course_names = {}
with open(COURSE_CODES_FILE, encoding="utf-8") as f:
for line in f:
if m := [Link](r"(\S+)\s+(.*\S)", line):
course_names[[Link](1)] = [Link](2)
enrollments_count = {}
with open(ENROLLMENTS_FILE, encoding="utf-8") as f:
for line in f:
course_code = [Link](r"\|.*\n", "", line)
if course_code not in enrollments_count:
enrollments_count[course_code] = 0
enrollments_count[course_code] += 1
for (course_code, enrollment) in sorted(enrollments_count.items()):
# if no name for course_code use ???
name = course_names.get(course_code, "???")
print(f"{enrollment:4} {course_code} {name}")
source code for count_enrollments.[Link]

[Link] COMP(2041|9044) 25T1 — Python Regular Expressions 18 / 28


Example - counting enrollments with split & counters
course_names = {}
with open(COURSE_CODES_FILE, encoding="utf-8") as f:
for line in f:
course_code, course_name = [Link]().split("\t", maxsplit=1)
course_names[course_code] = course_name
enrollments_count = [Link]()
with open(ENROLLMENTS_FILE, encoding="utf-8") as f:
for line in f:
course_code = [Link]("|")[0]
enrollments_count[course_code] += 1
for (course_code, enrollment) in sorted(enrollments_count.items()):
# if no name for course_code use ???
name = course_names.get(course_code, "???")
print(f"{enrollment:4} {course_code} {name}")
source code for count_enrollments.[Link]

[Link] COMP(2041|9044) 25T1 — Python Regular Expressions 19 / 28

Example - counting first names


already_counted = set()
first_name_count = [Link]()
with open(ENROLLMENTS_FILE, encoding="utf-8") as f:
for line in f:
_, student_number, full_name = [Link]("|")[0:3]
if student_number in already_counted:
continue
already_counted.add(student_number)
if m := [Link](r".*,\s+(\S+)", full_name):
first_name = [Link](1)
first_name_count[first_name] += 1
# put the count first in the tuples so sorting orders on count before name
count_name_tuples = [(c, f) for (f, c) in first_name_count.items()]
# print first names in decreasing order of popularity
for (count, first_name) in sorted(count_name_tuples, reverse=True):
print(f"{count:4} {first_name}")
source code for count_first_names.py

[Link] COMP(2041|9044) 25T1 — Python Regular Expressions 20 / 28

Example - finding duplicate first names using dict of dicts


course_first_name_count = {}
with open(ENROLLMENTS_FILE, encoding="utf-8") as f:
for line in f:
course_code, _, full_name = [Link]("|")[0:3]
if m := [Link](r".*,\s+(\S+)", full_name):
first_name = [Link](1)
else:
print("Warning could not parse line", [Link](),
↪ file=[Link])
continue
if course_code not in course_first_name_count:
course_first_name_count[course_code] = {}
if first_name not in course_first_name_count[course_code]:
course_first_name_count[course_code][first_name] = 0
course_first_name_count[course_code][first_name] += 1
for course in sorted(course_first_name_count.keys()):
for (first_name, count) in course_first_name_count[course].items():
if count >= REPORT_MORE_THAN_STUDENTS:
print(course, "has", count, "students named", first_name)
source code for duplicate_first_names.[Link]

[Link] COMP(2041|9044) 25T1 — Python Regular Expressions 21 / 28


Example - finding duplicate first names using split & defaultdict of counters
course_first_name_count = [Link]([Link])
with open(ENROLLMENTS_FILE, encoding="utf-8") as f:
for line in f:
course_code, _, full_name = [Link]("|")[0:3]
given_names = full_name.split(",")[1].strip()
first_name = given_names.split(" ")[0]
course_first_name_count[course_code][first_name] += 1
for (course, name_counts) in sorted(course_first_name_count.items()):
for (first_name, count) in name_counts.items():
if count > REPORT_MORE_THAN_STUDENTS:
print(course, "has", count, "students named", first_name)
source code for duplicate_first_names.[Link]

[Link] COMP(2041|9044) 25T1 — Python Regular Expressions 22 / 28

Example - Changing Filenames with Regex


# written by andrewt@[Link] for COMP(2041|9044)
#
# Change the names of the specified files
# by substituting occurrances of regex with replacement
# (simple version of the perl utility rename)
import os
import re
import sys
if len([Link]) < 3:
print(f"Usage: {[Link][0]} <regex> <replacement> [files]",
↪ file=[Link])
[Link](1)
regex = [Link][1]
replacement = [Link][2]
for old_pathname in [Link][3:]:
new_pathname = [Link](regex, replacement, old_pathname, count=1)
if new_pathname == old_pathname:
continue
if [Link](new_pathname):
print(f"{[Link][0]}: '{new_pathname}' exists", file=[Link])
continue
try:
[Link](old_pathname, new_pathname)
except OSError as e:
print(f"{[Link][0]}: '{new_pathname}' {e}", file=[Link])
source code for rename_regex.py

[Link] COMP(2041|9044) 25T1 — Python Regular Expressions 23 / 28

Example - Changing Filenames with Regex & EVal


# written by andrewt@[Link] for COMP(2041|9044)
#
# Change the names of the specified files
# by substituting occurrances of regex with replacement
# (simple version of the perl utility rename)
#
# also demonstrating argument processing and use of eval
# beware eval can allow arbitrary code execution,
# it should not be used where security is importnat
import argparse
import os
import re
import sys
parser = [Link]()
# add required arguments
parser.add_argument("regex", type=str, help="match against filenames")
parser.add_argument("replacement", type=str, help="replaces matches with
↪ this")
parser.add_argument("filenames", nargs="*", help="filenames to be changed")
# add some optional boolean arguments
parser.add_argument(
"-d", "--dryrun", action="store_true", help="show changes but don't make
↪ them"
)
parser.add_argument(
"-v", "--verbose", action="store_true", help="print more information"
)
parser.add_argument(
"-e",
"--eval",
action="store_true",
help="evaluate replacement as python expression, match available as _",
)
# optional integer argument which defaults to 1
parser.add_argument(
"-n",
"--replace_n_matches",
type=int,
default=1,
help="replace n matches (0 for all matches)",
)
args = parser.parse_args()
def eval_replacement(match):
"""if --eval given, evaluate replacment string as Python
with the variable _ set to the matching part of the filename
"""
if not [Link]:
return [Link]
_ = [Link](0)
return str(eval([Link]))
for old_pathname in [Link]:
try:
new_pathname = [Link](
[Link], eval_replacement, old_pathname,
↪ count=args.replace_n_matches
)
except OSError as e:
print(
f"{[Link][0]}: '{old_pathname}': '{[Link]}' {e}",
file=[Link],
)
continue
if new_pathname == old_pathname:
if [Link]:
print("no change:", old_pathname)
continue
if [Link](new_pathname):
print(f"{[Link][0]}: '{new_pathname}' exists", file=[Link])
continue
if [Link]:
print(old_pathname, "would be renamed to", new_pathname)
continue
if [Link]:
print("'renaming", old_pathname, "to", new_pathname)
try:
[Link](old_pathname, new_pathname)
except OSError as e:
print(f"{[Link][0]}: '{new_pathname}' {e}", file=[Link])
source code for rename_regex_eval.py

[Link] COMP(2041|9044) 25T1 — Python Regular Expressions 24 / 28


Example - When Harry Met Hermione #0

# For each file given as argument replace occurrences of Hermione


# allowing for some misspellings with Harry and vice-versa.
# Relies on Zaphod not occurring in the text.
import re, sys, os
for filename in [Link][1:]:
tmp_filename = filename + ".new"
if [Link](tmp_filename):
print(f"{[Link][0]}: {tmp_filename} already exists\n",
↪ file=[Link])
[Link](1)
with open(filename) as f:
with open(tmp_filename, "w") as g:
for line in f:
changed_line = [Link](r"Herm[io]+ne", "Zaphod", line)
changed_line = changed_line.replace("Harry", "Hermione")
changed_line = changed_line.replace("Zaphod", "Harry")
[Link](changed_line)
[Link](tmp_filename, filename)
source code for change_names.[Link]

[Link] COMP(2041|9044) 25T1 — Python Regular Expressions 25 / 28

Example - When Harry Met Hermione #1

# For each file given as argument replace occurrences of Hermione


# allowing for some misspellings with Harry and vice-versa.
# Relies on Zaphod not occurring in the text.
import re, sys, os, shutil, tempfile
for filename in [Link][1:]:
with [Link](mode='w', delete=False) as tmp:
with open(filename) as f:
for line in f:
changed_line = [Link](r"Herm[io]+ne", "Zaphod", line)
changed_line = changed_line.replace("Harry", "Hermione")
changed_line = changed_line.replace("Zaphod", "Harry")
[Link](changed_line)
[Link]([Link], filename)
source code for change_names.[Link]

[Link] COMP(2041|9044) 25T1 — Python Regular Expressions 26 / 28

Example - When Harry Met Hermione #2

# For each file given as argument replace occurrences of Hermione


# allowing for some misspellings with Harry and vice-versa.
# Relies on Zaphod not occurring in the text.
# modified text is stored in a list then file over-written
import re, sys, os
for filename in [Link][1:]:
changed_lines = []
with open(filename) as f:
for line in f:
changed_line = [Link](r"Herm[io]+ne", "Zaphod", line)
changed_line = changed_line.replace("Harry", "Hermione")
changed_line = changed_line.replace("Zaphod", "Harry")
changed_lines.append(changed_line)
with open(filename, "w") as g:
[Link]("".join(changed_lines))
source code for change_names.[Link]

[Link] COMP(2041|9044) 25T1 — Python Regular Expressions 27 / 28


Example - When Harry Met Hermione #3

# For each file given as argument replace occurrences of Hermione


# allowing for some misspellings with Harry and vice-versa.
# Relies on Zaphod not occurring in the text.
# modified text is stored in a single string then file over-written
import re, sys, os
for filename in [Link][1:]:
changed_lines = []
with open(filename) as f:
text = [Link]()
changed_text = [Link](r"Herm[io]+ne", "Zaphod", text)
changed_text = changed_text.replace("Harry", "Hermione")
changed_text = changed_text.replace("Zaphod", "Harry")
with open(filename, "w") as g:
[Link]("".join(changed_text))
source code for change_names.[Link]

[Link] COMP(2041|9044) 25T1 — Python Regular Expressions 28 / 28

You might also like