Regular Expression — Python-Part III

Photo by Leif Linding on Unsplash

Grouping

>>> import re
>>> p = re.compile('(po){1}')
>>> p.search('hippopotamus')
<re.Match object; span=(3, 5), match='po'>
>>> p = re.compile('(po){1,2}')
>>> p.search('hippopotamus')
<re.Match object; span=(3, 7), match='popo'>
>>> import re
>>> p = re.compile('i(p)p((o)po)')
>>> p.search('hippopotamus')
<re.Match object; span=(1, 7), match='ippopo'>
>>> m = p.search('hippopotamus')
>>> m.group(0)
'ippopo'
>>> m.group(1)
'p'
>>> m.group(2)
'opo'
>>> m.group(3)
'o'
>>> m.groups()
('p', 'opo', 'o')
>>> m.group(0, 3)
('ippopo', 'o')
>>> import re
>>> p = re.compile(r'\b(\w+),\s\1\b') # eg. word,<space>
>>> p.search('The closest living relatives of the hippopotamids are \
cetaceans (whales, dolphins, dolphins, porpoises, etc.)') >>> p.search('The closest living relatives of the hippopotamids are cetaceans (whales, dolphins, dolphins, porpoises, etc.)')
<re.Match object; span=(73, 91), match='dolphins, dolphins'>

Non-capturing and Named groups

>>> import re
>>> m = re.match("([mai])+", "maitake") # regular
>>> m.groups()
('i',)
>>> m = re.match("(?:[mai])+", "maitake") # non-capturing
>>> m.groups()
()
>>> import re
>>> p = re.compile(r'(?P<height>\b\d+.\d+\b)')
>>> m = p.search('Hippos measure 2.90 to 5.05 m (9.5 to 16.6 ft) long.')
>>> m.group('height')
'2.90'
>>> m.group(1)
'2.90'
>>> import re
>>> p = re.compile(r'(?P<height>\b\d+.\d+\b)')
>>> it = p.finditer('Hippos measure 2.90 to 5.05 m (9.5 to 16.6 ft) long.')
>>> for match in it:
... print(match.group('height'))
...
2.90
5.05
9.5
16.6
>>> import re
>>> p = re.compile(r'(?P<name>\w+) (?P<kingdom>\w+)')
>>> m = p.match('hippopotamus animalia')
>>> m.groupdict()
{'name': 'hippopotamus', 'kingdom': 'animalia'}
>>> import re
>>> p = re.compile(r'\b(?P<animal>\w+),\s(?P=animal)\b')
>>> p.search('The closest living relatives of the hippopotamids are cetaceans (whales, dolphins, dolphins, porpoises, etc.')
<re.Match object; span=(73, 91), match='dolphins, dolphins'>

Lookahead Assertions

  • Positive lookahead assertion (?=...)
  • Negative lookahead assertion (?!...)
>>> import re
>>> p = re.compile(r'[a-z]*[.](?=[a-z])') # positive lookahead, pass
>>> m = p.search("hippo.exe")
>>> m
<re.Match object; span=(0, 6), match='hippo.'>

>>> p = re.compile(r'[a-z]*[.](?=[0-9])') # positive lookahead, fail
>>> m = p.search("hippo.exe")
>>> m # no match is present

>>> p = re.compile(r'[a-z]*[.](?![0-9])') # negative lookahead, pass
>>> m = p.search("hippo.exe")
>>> m
<re.Match object; span=(0, 6), match='hippo.'>

>>> p = re.compile(r'[a-z]*[.](?![a-z])') # negative lookahead, fail
>>> m = p.search("hippo.exe")
>>> m # no match present

Modifying Strings

>>> import re
>>> p = re.compile(r'\W+') # non-capturing regex
>>> m = p.split('After elephants and rhinos, the hippopotamus is the next
largest land mammal.')
>>> m
['After', 'elephants', 'and', 'rhinos', 'the', 'hippopotamus', 'is', \
'the', 'next', 'largest', 'land', 'mammal', '']

>>> m1 = p.split('After elephants and rhinos, the hippopotamus is the next
largest land mammal.', maxsplit=5) # using maxsplit
>>> m1
['After', 'elephants', 'and', 'rhinos', 'the', \
'hippopotamus is the next largest land mammal.']
>>> p = re.compile(r'(\W+)') # capturing regex
>>> m = p.split('After elephants and rhinos, the hippopotamus is the next
largest land mammal.')
>>> m
['After', ' ', 'elephants', ' ', 'and', ' ', 'rhinos', ', ', 'the', \
' ', 'hippopotamus', ' ', 'is', ' ', 'the', ' ', 'next', ' ', 'largest', \
' ', 'land', ' ', 'mammal', '.', '']

Search and Replace

>>> p = re.compile(r'\b[a-z]+[^i][s]\b')
>>> m = p.findall('Hippos inhabits rivers, lakes, and mangrove swapms.')
>>> m
['inhabits', 'rivers', 'lakes', 'swapms']

>>> p = re.compile(r'\b[a-z]+[s]\b')
>>> p.sub('place', 'Hippos inhabits rivers, lakes, and mangrove swapms.') # sub
'Hippos place place, place, and mangrove place.'

>>> p.subn('place', 'Hippos inhabits rivers, lakes, and mangrove swapms.') # subn
('Hippos place place, place, and mangrove place.', 4)

>>> def triple_exclaim(match):
... value = match.group()
... return value.replace('!', '!!!')

>>> p = re.compile(r'\w+!')
>>> p.sub(triple_exclaim, 'Male hippos appear to continue growing throughout \
their lives! Female reach maximum weight at around \
age 25!')
'Male hippos appear to continue growing throughout their lives!!! \
Female reach maximum weight at around age 25!!!'
>>> re.sub(r'\w+!', triple_exclaim, 'Male hippos appear to continue growing \
throughout their lives! Female reach maximum weight at around age \
25!')
'Male hippos appear to continue growing throughout their lives!!! \
Female reach maximum weight at around age 25!!!'

Some directions

  • When possible consider methods of str object.
  • match() method matches only RE at the beginning of a string while search() will scan forward through a string.
  • When possible use non-greedy quantifiers such as *? , +? , ?? or {m,n}? . They match as little text as possible.
>>> p = re.compile(r'<.*>') # greedy
>>> m = p.match("<html><head></head><body></body></html>")
>>> m.span()
(0, 39)
>>> m.group()
'<html><head></head><body></body></html>'

>>> p = re.compile(r'<.*?>') # non-greedy
>>> m = p.match("<html><head></head><body></body></html>")
>>> m.span(), m.group()
((0, 6), '<html>')

To summarize:

  • Groupings are marked by () . There can be non-capturing groups and named groups. We can also retrive groups as dictionary.
  • Backreference allow us to specify previously captured group.
  • Lookahead assertions are zero-width assertions and can be positive lookahead or negative lookahead.
  • For search and replace, we have sub() and subn() methods.

--

--

Software Engineer

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store