Regular expressions

Here are more information about Python regular expressions:

- The Python re module: http://docs.python.org/library/re.html
- Regular expression HOWTO: http://docs.python.org/howto/regex.html

Use raw strings!

Always use Python raw strings. r"…", r'…', r"""…""", or r'''…'''

Regexp special characters

- r"." matches any character
- r"^…" / r"…$" matches the start / end of the string
- r"…?" / r"…*" / r"…+" matches a regexp at most once / any number of times / at least once
- r"…??" / r"…*?" / r"…+?" are the non-greedy alternatives
- r"…|…|…" matches either this or that or that
- r"[…]" matches any of the characters given
  - r"[^…]" matches any character not given
  - r"[^a-z]" matches any character between "a" and "z" (according to the unicode order)
  - r"[…-]" matches a minus (in addition to the other given characters)
  - r"[[…]" matches a closing bracket (in addition to the other given characters)
- backslashed characters
  - r"\." / r"\[" / r"\\" / etc. matches the literal symbol
  - r"\s" matches a whitespace character
    - r"\S" matches non-whitespace
  - r"\w" matches a letter or a digit or underscore
    - r"\W" matches non-(letter|digit|underscore)
  - r"\d" matches a digit
    - r"\D" matches non-digit
  - r"\b" matches a word boundary
    - i.e., the empty string but only in the context of r"\W\w" or r"\w\W"
- r"[^\W\d_]" matches only letters
  - try to understand why
  - the idea is taken from http://stackoverflow.com/questions/1673749
- r"…(…)…" matches the whole regexp, but captures the part inside parenthesis so that you can look it up later
  - r"…(?:…)…" matches the regexp, and does not capture the parenthesis

Regexp functions

These are the main regexp functions:

- re.compile(r"…", [flags])
- returns an object with all the methods below (but with no pattern argument, since this is already compiled)
- re.split(r"…", string, [maxsplit])
- returns a list of strings, split by the pattern
  - if there are capturing parentheses r"…(…)…" in the pattern, then their values are also returned as part of the list
- re.findall(r"…", string, [flags])
- returns a list of all substrings that match the pattern
  - if there are n capturing parentheses r"…(…)…" in the pattern, then return a list of n-tuples
- re.sub(r"…", replacement, string, [count])
- substitutes each occurrence of the pattern with the replacement string
- if there are n capturing parentheses r"…(…)…" in the pattern, then you can use \1, …, \n as backreferences in the replacement string
- re.search(r"…", string, [flags])
  - returns a MatchObject, or None
  - read more about match objects here: http://docs.python.org/library/re.html#match-objects
- re.finditer(r"…", string, [flags])
- returns an iterator of MatchObject, that can be used in a for loop

Match objects

Match objects are returned by re.search and re.finditer. They have the following methods:

- .group() => returns the matching string
- m.group() == m.group(0)
- .group(k) => returns the kth capture group (k=0 means the whole match)
- .groups() => returns a tuple of all capture groups
- .start() => returns the start position of the match in the search string
  - .start(k) => returns the start position of the kth capture group
- .end() => returns the end position of the match in the search string
  - .end(k) => returns the end position of the kth capture group
- .span() => returns a tuple (start, end)
- m.span(k) == (m.start(k), m.end(k))

Page updated

Google Sites

Report abuse