5.6. Quantifiers

A quantifier metacharacter immediately follows a portion of a <regex> and indicates how many times that portion must occur for the match to succeed.

*

Matches zero or more repetitions of the preceding regex.

This quantified regex match if the preceding regex once or more or not at all.

For example [0-9]* matches zero or more digit characters. That means it would match an empty string, '1', '241', '48290', and so on.

>Note that since we didn't specify any other character before [0-9] in line 5 and the string does not start with [0-9], it returns an empty string.

+

Matches zero or more repetitions of the preceding regex.

This is similar to *, but the quantified regex must occur at least once.

>Note that now line 5 does not return an empty string.

*? +?

The non-greedy (or lazy) versions of the * and + quantifiers.

When used alone, + and * are all greedy, meaning they produce the longest possible match. If you want the shortest possible match instead, then use the non-greedy *? or +?.

In this case, the match ends with the first '>', character following 'one'.

{m}

Matches exactly m repetitions of the preceding regex.

This is similar to * or +, but it specifies exactly how many times the preceding regex must occur for a match to succeed.

> Here, x\d{3}x matches 'x', followed by exactly three digits, followed by another 'x'. The match fails when there are fewer or more than three digits between the 'x' characters.

{m,n}

Matches any number of repetitions of the preceding regex from m to n, inclusive.

This is similar to * or +, but it specifies exactly how many times the preceding regex must occur for a match to succeed.

The non-greedy version is {m,n}?.

> The quantified <regex> is -{2,4}. The match succeeds when there are two, three, or four dashes between the 'x' characters but fails otherwise. Omitting m implies a lower bound of 0, and omitting n implies an unlimited upper bound.

(<regex>)

Defines a subexpression or group.

This is the most basic grouping construct. A regex in parentheses just matches the contents of the parentheses.

> bar+ matches 'bar', 'barr', 'barrr'
(bar)+ matches 'bar', 'barbar', 'barbarbar'

Exercise 5.6

Save the python.txt into a variable text. Then do the following:

  1. Find all the capital letter
    Find all the words that started with capital letter other than Python and CPython. Then print the result and the length of that list.
    Hint: You can use if statement and .group().

  2. Find all the words started with 'r' and ended with 'd' or 'e'
    Find all words that started with the letter 'r' and ended with the letter 'd' or 'e'. Print the result and the length of that list.

  3. Find all the words with dash -.
    Find all compound words that contains a single - in it. (Such as high-level, non-profit). Print the result and the length of that list.

  4. Find all the words with ten or more letters.
    Find all words with ten or more letters. Print the result and the length of that list.