Regular expressions are strings that define a rule (or "pattern") for how other strings may be constructed.
Patterns are a sequence of:
Special characters are used to define:
.
one of a set of characters:
[abc]
one of the characters a, b, or c[^abc]
any character except a, b, or c [a-z]
any character in the range a through zone pattern or another:
|
q: pattern p or pattern q{n,m}
pattern p repeated between n and m times where:^
or $
by themselves: "anchor" the patterns to the start or end of a string\x
: the character x There are also abbreviations for often-used patterns:
- `*` means `{0,}`
- `+` means `{1,}`
- `?` means `{0,1}`
- `\d` means `[0-9]`
- `\s` means "any whitespace character"
- `\w` means "any word (alphanumeric) character"
Note: when using backslash (), you must specify it twice (or prefix the string with r
, see below).
A string literal can be prefixed with:
r
(for "raw") to ignore escape sequences starting with \
; this is commonly used with regular expressions since they often need to include a literal backslashb
(for "byte") to create a byte string; this avoids the need to encode()
the string into bytesf
(for "format") to embed sequences of the form {
expression}
in the string; these are replaced with the value of expression when that string is itself used in an expressionThe r
prefix is often used with regular expression patterns, particularly when they contain backslash escape sequences.
Functions that use regular expressions are defined in the re
package.
The function match(pattern,string)
test for pattern matching at the start of string.
The function search(pattern,string)
test for pattern matching anywhere in string.
If the pattern is found, these functions return a match object which contains information about the match. If the pattern is not found, they return None
.
import re
print(re.match('[abc]',"a"))
print(re.match('([abcx])[abc]',"xa"))
print(re.search('[abc]',"xa"))
<re.Match object; span=(0, 1), match='a'> <re.Match object; span=(0, 2), match='xa'> <re.Match object; span=(1, 2), match='a'>
The group(n)
member of a match object returns the n'th parenthesized part of the match (group(0) is the whole match, group(1) is the first parenthesized sub-pattern. For example, to extract a number that is surrounded by dashes:
import re
m = re.search(r'(-(\d{3,4}))(-)',"abc-1234-456")
print(m,m.group(0),m.group(1),m.group(2))
<re.Match object; span=(3, 9), match='-1234-'> -1234- -1234 1234
BCIT ID: A\d{8}
Canadian Postal Code: [A-Z]\d[A-Z] \d[A-Z]\d
North American phone number with optional area code: (\d{3}-)?\d\{3}-\d{4}
An IPv4 address: (\d{1,3}\.){3}\d{1,3}
import re
(re.search( r'[A-Z]\d[A-Z] \d[A-Z]\d',"V3H 2A1"),
re.search( r'A\d{8}', 'A001234567'),
re.search( r'(\d{3}-)?\d{3}-\d{4}', '432-8936'),
re.search( r'(\d{1,3}\.){3}\d{1,3}', '142.232.230.10'))
(<re.Match object; span=(0, 7), match='V3H 2A1'>, <re.Match object; span=(0, 9), match='A00123456'>, <re.Match object; span=(0, 8), match='432-8936'>, <re.Match object; span=(0, 14), match='142.232.230.10'>)
Set the variable pat
to a regular expression that matches part numbers created according to the following sequence:
a prefix of either JSM or CP
a single digit between 1 and 4
an optional letter S
the letter A
an optional letter H
a dash
the letters 12V or 24V
import re
pat=r"(JSM|CP)[1-4]S{0,1}AH?-(12V|24V)"
re.search(pat,'JSM1SAH-12')