RegularExpressions

Regular Expressions¶

Regular expressions are strings that define a rule (or "pattern") for how other strings may be constructed.

Patterns are a sequence of:

normal characters which are used as-is
other characters that have special meanings:

Special characters are used to define:

any one character:
- .
one of a set of characters:
- [abc] one of the characters a, b, or c
- [^abc] any character except a, b, or c
- [a-z] any character in the range a through z
one pattern or another:
- p|q: pattern p or pattern q
a sequence of patterns:
- (pq...) pattern p, followed by pattern q, ...
repetitions of a pattern:
- p{n,m} pattern p repeated between n and m times where:
  - n defaults to 0
  - m defaults to infinity
^ or $ by themselves: "anchor" the patterns to the start or end of a string
remove special meaning of a character:
- \x: the character x

There are also abbreviations for often-used patterns:

- `*` means `{0,}`
- `+` means `{1,}`
- `?` means `{0,1}`
- `\d` means `[0-9]`
- `\s` means "any whitespace character"
- `\w` means "any word (alphanumeric) character"

Note: when using backslash (), you must specify it twice (or prefix the string with r, see below).

String Literal Prefixes¶

A string literal can be prefixed with:

an r (for "raw") to ignore escape sequences starting with \; this is commonly used with regular expressions since they often need to include a literal backslash
a b (for "byte") to create a byte string; this avoids the need to encode() the string into bytes
an f (for "format") to embed sequences of the form {expression} in the string; these are replaced with the value of expression when that string is itself used in an expression

The r prefix is often used with regular expression patterns, particularly when they contain backslash escape sequences.

Using Regular Expressions¶

Functions that use regular expressions are defined in the re package.

The function match(pattern,string) test for pattern matching at the start of string.

The function search(pattern,string) test for pattern matching anywhere in string.

If the pattern is found, these functions return a match object which contains information about the match. If the pattern is not found, they return None.

In [6]:

import re
print(re.match('[abc]',"a"))
print(re.match('([abcx])[abc]',"xa"))
print(re.search('[abc]',"xa"))

<re.Match object; span=(0, 1), match='a'>
<re.Match object; span=(0, 2), match='xa'>
<re.Match object; span=(1, 2), match='a'>

Match Objects¶

The group(n) member of a match object returns the n'th parenthesized part of the match (group(0) is the whole match, group(1) is the first parenthesized sub-pattern. For example, to extract a number that is surrounded by dashes:

In [1]:

import re
m = re.search(r'(-(\d{3,4}))(-)',"abc-1234-456")
print(m,m.group(0),m.group(1),m.group(2))

<re.Match object; span=(3, 9), match='-1234-'> -1234- -1234 1234

Examples¶

BCIT ID: A\d{8}

Canadian Postal Code: [A-Z]\d[A-Z] \d[A-Z]\d

North American phone number with optional area code: (\d{3}-)?\d\{3}-\d{4}

An IPv4 address: (\d{1,3}\.){3}\d{1,3}

In [9]:

import re
(re.search( r'[A-Z]\d[A-Z] \d[A-Z]\d',"V3H 2A1"), 
 re.search( r'A\d{8}', 'A001234567'),
 re.search( r'(\d{3}-)?\d{3}-\d{4}', '432-8936'),
 re.search( r'(\d{1,3}\.){3}\d{1,3}', '142.232.230.10'))

Out[9]:

(<re.Match object; span=(0, 7), match='V3H 2A1'>,
 <re.Match object; span=(0, 9), match='A00123456'>,
 <re.Match object; span=(0, 8), match='432-8936'>,
 <re.Match object; span=(0, 14), match='142.232.230.10'>)

Set the variable pat to a regular expression that matches part numbers created according to the following sequence:

a prefix of either JSM or CP
a single digit between 1 and 4
an optional letter S
the letter A
an optional letter H
a dash
the letters 12V or 24V