Python Regular Expression Pattern Classification

There is a lot of relevant information online, but it's all messy. So I classified it, for my own convenience and to save trouble for colleagues who see this article.

Tip: The pattern parameter in regular expressions usually contains backslashes. To prevent them from being recognized as escape characters, it is best to use raw strings r'' to represent them. (For example, r'\t' is equivalent to \\t) to match the corresponding special characters.

The advantage of using raw strings is that you can write fewer backslashes, making the code easier to read.

Four main functions of regular expressions: matching (bool value), capturing, replacing, and splitting.

Matching Single Characters#

.: Represents any single character. For example, the expression 't.o' matches too or two.
[...]: Matches any single character in the characters listed in the brackets.
- A single [] represents a single character: [amk] matches 'a', 'm', or 'k'. [a-zA-Z0-9] represents all letters + numbers. [A-Z][a-z] represents the first character as an uppercase letter and the second character as a lowercase letter, i.e., two characters.
- Note that there is no need to use , to separate the contents. [0-35-9] represents 0 to 3 and 5 to 9.
- You can use - to specify a range. For example, [abc] and [a-c] have the same meaning.
[^...]: Matches any single character not in the brackets.
- [^abc] matches any character except a, b, or c.

Matching Single Characters - They Can Be Written in Character Sets []#

\w: Matches alphanumeric characters and underscores. Equivalent to [a-zA-Z0-9_]. In Unicode characters, it can also match Chinese characters and full-width numbers.
\W: Matches non-alphanumeric characters and underscores. Equivalent to [^A-Za-z0-9_].
\s: Matches any whitespace character, space, or tab key. Equivalent to [\t\n\r\f].
\S: Matches any non-whitespace character. Equivalent to [^\t\n\r\f].
\d: Matches any digit. Equivalent to [0-9].
\D: Matches any non-digit. Equivalent to [^0-9].

Matching Multiple Characters#

*: Matches the preceding character zero or more times. For example, the expression abc* matches ab and abccc.
+: Matches the preceding character one or more times. For example, the expression abc+ matches abc and abccc.
?: Matches the preceding character one or zero times, either once or not at all. Non-greedy.
{m}: Matches the preceding character exactly m times. For example, the expression ab{2}c matches abbc.
{m,}: Matches the preceding expression at least m times.
{,n}: Matches the preceding regular expression at most n times.
{m,n}: Matches the preceding character at least m and at most n times. For example, the expression ab{1,3}c matches abc, abbc, and abbbc.

Special Matches#

|: Matches either the left or right expression. For example, to match numbers between 0 and 100, use the expression re.match(r'[1-9]?\d$|100',string).
(): Matches the expression inside the parentheses, can also represent a group (as a whole), and can be used in combination with |, such as when recognizing emails (163|126|qq).
\b: Matches a word boundary.
- \b is a zero-width character that does not occupy a character. It represents the existence of a boundary line. On both sides of \b, one must be \w and the other must be \W.
- It is used to match characters at boundaries, such as \bhe can match "he" but not "where".
- \bman\b can match man in "I am a man, not a woman", but man\b will match both man and woman because the latter "man" is also at a boundary. Be careful when using it.
\B: Matches a non-word boundary.
- The opposite of \b, it can be understood as follows: on both sides of \B, either all are \w or all are \W.
- For example, py=\B can match "py=" in "py==", but not "py=" in "py=1".
\n, \t, etc.: Matches a newline character, matches a tab character, etc.

Positional Matches#

^: Matches the beginning of a string.
$: Matches the end of a string.
- Example: re.match(r'[\w]{4,20}@163\.com$','[email protected]')
- Can be used in combination with groups.
\A: Matches the beginning of a string.
\Z: Matches the end of a string, if there is a newline, it only matches the end of the string before the newline.
\z: Matches the end of a string.
\G: Matches the position of the last match completed.

Examples#

[Pp]ython: Matches "Python" or "python".
rub[ye]: Matches "ruby" or "rube".
[aeiou]: Matches any letter in the brackets.
[0-9]: Matches any digit. Similar to [0123456789].
[a-z]: Matches any lowercase letter.
[A-Z]: Matches any uppercase letter.
[a-zA-Z0-9]: Matches any letter or number.
[^aeiou]: Matches any character except the letters a, e, i, o, or u.
[^0-9]: Matches any character except a digit.

*: Asterisk. Matches the preceding character 0 to n times. For example, 'pytho*n' can match pythn, pythoon, pythooooon, etc. There are also other matching repeated characters such as ?, +, or {m,n}, where {n,m} can be used flexibly to represent matching from n to m times.

The usage of \b can also be flexible. In a given string, find words starting with lowercase letters.

ss = "i Am a gOod boy  baby!!"
result=re.findall(r'\b[a-z][a-zA-Z]*\b',ss)
print(result)