Python - Regular Expressions

Hello there, future Python wizards! Today, we're going to embark on an exciting journey into the world of Regular Expressions (regex) in Python. Don't worry if you've never heard of regex before - by the end of this tutorial, you'll be wielding this powerful tool like a pro!

Python - Reg Expressions

What are Regular Expressions?

Before we dive in, let's understand what regular expressions are. Imagine you're a detective trying to find a specific pattern in a sea of text. Regular expressions are like your magnifying glass, helping you search for and manipulate strings based on patterns. Cool, right?

Raw Strings

In Python, when working with regex, we often use raw strings. These are prefixed with an 'r' and treat backslashes as literal characters. This is particularly useful in regex as backslashes are common.

# Normal string
print("Hello\nWorld")
# Raw string
print(r"Hello\nWorld")

In the first case, you'll see "Hello" and "World" on separate lines. In the second, you'll see "Hello\nWorld" as is. This becomes crucial when working with regex patterns.

Metacharacters

Metacharacters are the building blocks of regex. They have special meanings and help us define patterns. Let's look at some common ones:

Metacharacter Meaning
. Matches any character except newline
^ Matches the start of a string
$ Matches the end of a string
* Matches 0 or more repetitions
+ Matches 1 or more repetitions
? Matches 0 or 1 repetition
{} Matches an explicitly specified number of repetitions
[] Specifies a set of characters to match
\ Escapes special characters

The re.match() Function

The re.match() function attempts to match a pattern at the beginning of a string. If it finds a match, it returns a match object; otherwise, it returns None.

import re

result = re.match(r"Hello", "Hello, World!")
if result:
    print("Match found:", result.group())
else:
    print("No match")

This will print "Match found: Hello". The group() method returns the matched substring.

The re.search() Function

While re.match() looks for a match at the beginning of a string, re.search() scans the entire string for a match.

import re

result = re.search(r"World", "Hello, World!")
if result:
    print("Match found:", result.group())
else:
    print("No match")

This will print "Match found: World".

Matching Vs Searching

The main difference between match() and search() is that match() checks for a match only at the beginning of the string, while search() checks for a match anywhere in the string.

The re.findall() Function

The re.findall() function returns all non-overlapping matches of a pattern in a string as a list.

import re

text = "The rain in Spain falls mainly in the plain"
result = re.findall(r"ain", text)
print(result)

This will print ['ain', 'ain', 'ain'].

The re.sub() Function

The re.sub() function replaces all occurrences of a pattern in a string with a replacement string.

import re

text = "The rain in Spain"
result = re.sub(r"a", "o", text)
print(result)

This will print "The roin in Spoin".

The re.compile() Function

The re.compile() function creates a regex object for reuse, which can be more efficient if you're using the same pattern multiple times.

import re

pattern = re.compile(r"\d+")
result1 = pattern.findall("There are 123 apples and 456 oranges")
result2 = pattern.findall("I have 789 bananas")

print(result1)
print(result2)

This will print ['123', '456'] and ['789'].

The re.finditer() Function

The re.finditer() function returns an iterator yielding match objects for all non-overlapping matches of a pattern in a string.

import re

text = "The rain in Spain"
for match in re.finditer(r"ain", text):
    print(f"Found '{match.group()}' at position {match.start()}-{match.end()}")

This will print:

Found 'ain' at position 5-8
Found 'ain' at position 17-20

Use Cases of Python Regex

Regular expressions have numerous practical applications. Let's look at a common use case:

Finding words starting with vowels

import re

text = "An apple a day keeps the doctor away"
vowel_words = re.findall(r'\b[aeiouAEIOU]\w+', text)
print(vowel_words)

This will print ['An', 'apple', 'a', 'away'].

Regular Expression Modifiers: Option Flags

Python's re module provides several option flags that modify how patterns are interpreted:

Flag Description
re.IGNORECASE (re.I) Performs case-insensitive matching
re.MULTILINE (re.M) Makes ^ match the start of each line and $ the end of each line
re.DOTALL (re.S) Makes . match any character, including newline
re.VERBOSE (re.X) Allows you to write more readable regex patterns

Regular Expression Patterns

Let's explore some more advanced patterns:

Character classes

Character classes allow you to specify a set of characters to match:

import re

text = "The quick brown fox jumps over the lazy dog"
result = re.findall(r"[aeiou]", text)
print(result)

This will print all vowels found in the text.

Special Character Classes

Python regex supports special character classes:

Class Description
\d Matches any decimal digit
\D Matches any non-digit character
\s Matches any whitespace character
\S Matches any non-whitespace character
\w Matches any alphanumeric character
\W Matches any non-alphanumeric character

Repetition Cases

We can specify how many times a pattern should occur:

import re

text = "I have 111 apples and 22 oranges"
result = re.findall(r"\d{2,3}", text)
print(result)

This will print ['111', '22'], matching numbers with 2 or 3 digits.

Nongreedy repetition

By default, repetition is greedy, meaning it matches as much as possible. Adding a ? after the repetition makes it non-greedy:

import re

text = "<h1>Title</h1><p>Paragraph</p>"
greedy = re.findall(r"<.*>", text)
non_greedy = re.findall(r"<.*?>", text)
print("Greedy:", greedy)
print("Non-greedy:", non_greedy)

This will show the difference between greedy and non-greedy matching.

Grouping with Parentheses

Parentheses allow you to group parts of the regex:

import re

text = "John Smith ([email protected])"
result = re.search(r"(\w+) (\w+) \((\w+@\w+\.\w+)\)", text)
if result:
    print(f"Full Name: {result.group(1)} {result.group(2)}")
    print(f"Email: {result.group(3)}")

This extracts the name and email from the text.

Backreferences

Backreferences allow you to refer to previously matched groups:

import re

text = "<h1>Title</h1><p>Paragraph</p>"
result = re.findall(r"<(\w+)>.*?</\1>", text)
print(result)

This matches opening and closing HTML tags.

Alternatives

The | character allows you to specify alternatives:

import re

text = "The color of the sky is blue or gray"
result = re.search(r"blue|gray", text)
if result:
    print(f"Found color: {result.group()}")

This matches either "blue" or "gray".

Anchors

Anchors specify positions in the text:

import re

text = "Python is awesome"
start = re.match(r"^Python", text)
end = re.search(r"awesome$", text)
print(f"Starts with Python: {bool(start)}")
print(f"Ends with awesome: {bool(end)}")

This checks if the text starts with "Python" and ends with "awesome".

Special Syntax with Parentheses

Parentheses can be used for more than just grouping:

  • (?:...) creates a non-capturing group
  • (?P...) creates a named group
  • (?=...) creates a positive lookahead
  • (?!...) creates a negative lookahead
import re

text = "Python version 3.9.5"
result = re.search(r"Python (?:version )?(?P<version>\d+\.\d+\.\d+)", text)
if result:
    print(f"Version: {result.group('version')}")

This extracts the version number, whether or not "version" is present in the text.

And there you have it, folks! We've journeyed through the land of Python regex, from the basics to some pretty advanced concepts. Remember, like any powerful tool, regex takes practice to master. So don't be discouraged if it feels overwhelming at first. Keep experimenting, and soon you'll be finding patterns like a pro detective! Happy coding!

Credits: Image by storyset