Regular Expression — Python-Part III

Diane Khambu
6 min readFeb 6, 2023
Photo by Leif Linding on Unsplash

This is my final post on Regex series. Here are my part I and part II series. We’ll go over groupings, lookahead assertion and search and replace functions in regular expression. Let’s buckle up and dive in!

Grouping

Groups are marked by ( , ) metacharacters. They have the same meaning as they do in mathematical expressions; they group together expression contained inside them. You can use quantifiers such as * , + , ? , or {m,n} .

>>> import re
>>> p = re.compile('(po){1}')
>>> p.search('hippopotamus')
<re.Match object; span=(3, 5), match='po'>
>>> p = re.compile('(po){1,2}')
>>> p.search('hippopotamus')
<re.Match object; span=(3, 7), match='popo'>

Here we captured po using () brackets. We also used quantifier {m,n} to indicate how many repetition of the captured group we want.

Groups are numbered starting with 0. Group 0 is always present and it is the whole regular expression RE. Sub-groups are numbered from left to right from 1.

>>> import re
>>> p = re.compile('i(p)p((o)po)')
>>> p.search('hippopotamus')
<re.Match object; span=(1, 7), match='ippopo'>
>>> m = p.search('hippopotamus')
>>> m.group(0)
'ippopo'
>>> m.group(1)
'p'
>>> m.group(2)
'opo'
>>> m.group(3)
'o'

Here we also captured sub-groups of matched object. We can use groups() to find all matched group.

>>> m.groups()
('p', 'opo', 'o')
>>> m.group(0, 3)
('ippopo', 'o')

We used groups() to find all subgroups. We also passed group number to group() to get tuple containing corresponding values for those groups.

Backreferences in a pattern allow us to specify the contents of an earlier capturing group must also be found at the current location in the string.

>>> import re
>>> p = re.compile(r'\b(\w+),\s\1\b') # eg. word,<space>
>>> p.search('The closest living relatives of the hippopotamids are \
cetaceans (whales, dolphins, dolphins, porpoises, etc.)') >>> p.search('The closest living relatives of the hippopotamids are cetaceans (whales, dolphins, dolphins, porpoises, etc.)')
<re.Match object; span=(73, 91), match='dolphins, dolphins'>

We captured dolphins, since it meets the backreferences pattern.

Non-capturing and Named groups

Python supports several of Perl’s extension and adds an extension to Perl’s extension syntax. If the first character after the question mark is P, it is an extension that’s specific to Python.

The syntax for non-capturing group is ?:<expression> . As the name tells it does not capture group.

>>> import re
>>> m = re.match("([mai])+", "maitake") # regular
>>> m.groups()
('i',)
>>> m = re.match("(?:[mai])+", "maitake") # non-capturing
>>> m.groups()
()

Non-capturing match is used when you want to use a group to denote a part of a regular expression but you are not interested in retrieving the group’s contents. It is particularly useful when you are modifying an existing pattern. There is no performance difference between capturing and non-capturing groups.

The syntax for a named group is one of the Python-specific extension: (?<name>...) . name is the name of the group. Named groups functions exactly like capturing groups in addition to associating name with a group. Named groups are still given numbers.

>>> import re
>>> p = re.compile(r'(?P<height>\b\d+.\d+\b)')
>>> m = p.search('Hippos measure 2.90 to 5.05 m (9.5 to 16.6 ft) long.')
>>> m.group('height')
'2.90'
>>> m.group(1)
'2.90'

We can also use named groups with iterator:

>>> import re
>>> p = re.compile(r'(?P<height>\b\d+.\d+\b)')
>>> it = p.finditer('Hippos measure 2.90 to 5.05 m (9.5 to 16.6 ft) long.')
>>> for match in it:
... print(match.group('height'))
...
2.90
5.05
9.5
16.6

We can retrieve named groups as a dictionary using groupdict() .

>>> import re
>>> p = re.compile(r'(?P<name>\w+) (?P<kingdom>\w+)')
>>> m = p.match('hippopotamus animalia')
>>> m.groupdict()
{'name': 'hippopotamus', 'kingdom': 'animalia'}

We can also have named backreferences. Previously, we used (...)\1 syntax. Here we’ll add named of the group like (?P=name) .

>>> import re
>>> p = re.compile(r'\b(?P<animal>\w+),\s(?P=animal)\b')
>>> p.search('The closest living relatives of the hippopotamids are cetaceans (whales, dolphins, dolphins, porpoises, etc.')
<re.Match object; span=(73, 91), match='dolphins, dolphins'>

We matched double dolphins !

Lookahead Assertions

It is another zero-width assertion. They don’t cause the engine to advance through the string; instead, they consume no characters at all, and simply succeed or fail.

There are two types of lookahead assertions:

  • Positive lookahead assertion (?=...)
  • Negative lookahead assertion (?!...)

Positive lookahead assertion succeeds if the contained regular expression matches at the current location, and fails otherwise.

>>> import re
>>> p = re.compile(r'[a-z]*[.](?=[a-z])') # positive lookahead, pass
>>> m = p.search("hippo.exe")
>>> m
<re.Match object; span=(0, 6), match='hippo.'>

>>> p = re.compile(r'[a-z]*[.](?=[0-9])') # positive lookahead, fail
>>> m = p.search("hippo.exe")
>>> m # no match is present

>>> p = re.compile(r'[a-z]*[.](?![0-9])') # negative lookahead, pass
>>> m = p.search("hippo.exe")
>>> m
<re.Match object; span=(0, 6), match='hippo.'>

>>> p = re.compile(r'[a-z]*[.](?![a-z])') # negative lookahead, fail
>>> m = p.search("hippo.exe")
>>> m # no match present

Notice that lookahead portion is not a part of the match objects, when match passes both in positive and negative lookahead assertions.

Modifying Strings

The method signature of split() is split(string, maxsplit=0) . If maxsplit is non-zero, at most maxsplit splits are performed.

>>> import re
>>> p = re.compile(r'\W+') # non-capturing regex
>>> m = p.split('After elephants and rhinos, the hippopotamus is the next
largest land mammal.')
>>> m
['After', 'elephants', 'and', 'rhinos', 'the', 'hippopotamus', 'is', \
'the', 'next', 'largest', 'land', 'mammal', '']

>>> m1 = p.split('After elephants and rhinos, the hippopotamus is the next
largest land mammal.', maxsplit=5) # using maxsplit
>>> m1
['After', 'elephants', 'and', 'rhinos', 'the', \
'hippopotamus is the next largest land mammal.']

If you want to know all the delimiter as well, we use capturing () regex.

>>> p = re.compile(r'(\W+)') # capturing regex
>>> m = p.split('After elephants and rhinos, the hippopotamus is the next
largest land mammal.')
>>> m
['After', ' ', 'elephants', ' ', 'and', ' ', 'rhinos', ', ', 'the', \
' ', 'hippopotamus', ' ', 'is', ' ', 'the', ' ', 'next', ' ', 'largest', \
' ', 'land', ' ', 'mammal', '.', '']

Search and Replace

We have sub() method to find all matches for a pattern and replace them with a different string.

The method signature for sub() is sub(replacement, string, count=0) . The subn() method does the same word as sub() but returns a 2-tuple containing the new string and the number of replacements that were performed.

>>> p = re.compile(r'\b[a-z]+[^i][s]\b')
>>> m = p.findall('Hippos inhabits rivers, lakes, and mangrove swapms.')
>>> m
['inhabits', 'rivers', 'lakes', 'swapms']

>>> p = re.compile(r'\b[a-z]+[s]\b')
>>> p.sub('place', 'Hippos inhabits rivers, lakes, and mangrove swapms.') # sub
'Hippos place place, place, and mangrove place.'

>>> p.subn('place', 'Hippos inhabits rivers, lakes, and mangrove swapms.') # subn
('Hippos place place, place, and mangrove place.', 4)

Now let’s look at an example where replacement argument is of type function.


>>> def triple_exclaim(match):
... value = match.group()
... return value.replace('!', '!!!')

>>> p = re.compile(r'\w+!')
>>> p.sub(triple_exclaim, 'Male hippos appear to continue growing throughout \
their lives! Female reach maximum weight at around \
age 25!')
'Male hippos appear to continue growing throughout their lives!!! \
Female reach maximum weight at around age 25!!!'

The replacement function is called for every non-overlapping occurrence of pattern. On each call, the function is passed a match object argument which we can use to compute desired string manipulation.

We can also use module level re.sub() using pattern as our first argument, replacement as second argument, and string as a third argument.

>>> re.sub(r'\w+!', triple_exclaim, 'Male hippos appear to continue growing \
throughout their lives! Female reach maximum weight at around age \
25!')
'Male hippos appear to continue growing throughout their lives!!! \
Female reach maximum weight at around age 25!!!'

Some directions

  • When possible consider methods of str object.
  • match() method matches only RE at the beginning of a string while search() will scan forward through a string.
  • When possible use non-greedy quantifiers such as *? , +? , ?? or {m,n}? . They match as little text as possible.
>>> p = re.compile(r'<.*>') # greedy
>>> m = p.match("<html><head></head><body></body></html>")
>>> m.span()
(0, 39)
>>> m.group()
'<html><head></head><body></body></html>'

>>> p = re.compile(r'<.*?>') # non-greedy
>>> m = p.match("<html><head></head><body></body></html>")
>>> m.span(), m.group()
((0, 6), '<html>')

To summarize:

  • Groupings are marked by () . There can be non-capturing groups and named groups. We can also retrive groups as dictionary.
  • Backreference allow us to specify previously captured group.
  • Lookahead assertions are zero-width assertions and can be positive lookahead or negative lookahead.
  • For search and replace, we have sub() and subn() methods.

That’s it! Congratulations for coming this far! 🎈🙌

Hope you were able to get firmer grip of regex!

See you in my next article. ✨

Inspiration:

You can support me in Patreon!

--

--