Regex: A sintaxe da "Regular Expression"

Regular Expression (Regex) Syntax (EN)

A Regular Expression (or Regex) is a pattern (or filter) that describes a set of strings that matches the pattern. In other words, a regex accepts a certain set of strings and rejects the rest.

A regex consists of a sequence of characters, metacharacters (such as ., \d, \D, \s, \S, \w, \W) and operators (such as +, *, ?, |, ^). They are constructed by combining many smaller sub-expressions.

1 Matching a Single Character

The fundamental building blocks of a regex are patterns that match a single character. Most characters, including all letters (a-z and A-Z) and digits (0-9), match itself. For example, the regex x matches substring "x"; z matches "z"; and 9 matches "9".

Non-alphanumeric characters without special meaning in regex also matches itself. For example, = matches "="; @ matches "@".

2 Regex Special Characters and Escape Sequences

Regex's Special Characters

These characters have special meaning in regex (I will discuss in detail in the later sections):

- metacharacter: dot (.)
- bracket list: [ ]
- position anchors: ^, $
- occurrence indicators: +, *, ?, { }
- parentheses: ( )
- or: |
- escape and metacharacter: backslash (\)

Escape Sequences

The characters listed above have special meanings in regex. To match these characters, we need to prepend it with a backslash (\), known as escape sequence. For examples, \+ matches "+"; \[ matches "["; and \. matches ".".

Regex also recognizes common escape sequences such as \n for newline, \t for tab, \r for carriage-return, \nnn for a up to 3-digit octal number, \xhh for a two-digit hex code, \uhhhh for a 4-digit Unicode, \uhhhhhhhh for a 8-digit Unicode.

3 Matching a Sequence of Characters (String or Text)

Sub-Expressions

A regex is constructed by combining many smaller sub-expressions or atoms. For example, the regex Friday matches the string "Friday". The matching, by default, is case-sensitive, but can be set to case-insensitive via modifier.

4 OR (|) Operator

You can provide alternatives using the "OR" operator, denoted by a vertical bar '|'. For example, the regex four|for|floor|4 accepts strings "four", "for", "floor" or "4".

5 Bracket List (Character Class) [...], [^...], [.-.]

A bracket expression is a list of characters enclosed by [ ], also called character class. It matches ANY ONE character in the list. However, if the first character of the list is the caret (^), then it matches ANY ONE character NOT in the list. For example, the regex [02468] matches a single digit 0, 2, 4, 6, or 8; the regex [^02468] matches any single character other than 0, 2, 4, 6, or 8.

Instead of listing all characters, you could use a range expression inside the bracket. A range expression consists of two characters separated by a hyphen (-). It matches any single character that sorts between the two characters, inclusive. For example, [a-d] is the same as [abcd]. You could include a caret (^) in front of the range to invert the matching. For example, [^a-d] is equivalent to [^abcd].

Most of the special regex characters lose their meaning inside bracket list, and can be used as they are; except ^, -, ] or \.

- To include a ], place it first in the list, or use escape \].
- To include a ^, place it anywhere but first, or use escape \^.
- To include a - place it last, or use escape \-.
- To include a \, use escape \\.
- No escape needed for the other characters such as ., +, *, ?, (, ), {, }, and etc, inside the bracket list
- You can also include metacharacters (to be explained in the next section), such as \w, \W, \d, \D, \s, \S inside the bracket list.

Name Character Classes in Bracket List (For Perl Only?)

Named (POSIX) classes of characters are pre-defined within bracket expressions. They are:

- [:alnum:], [:alpha:], [:digit:]: letters+digits, letters, digits.
- [:xdigit:]: hexadecimal digits.
- [:lower:], [:upper:]: lowercase/uppercase letters.
- [:cntrl:]: Control characters
- [:graph:]: printable characters, except space.
- [:print:]: printable characters, include space.
- [:punct:]: printable characters, excluding letters and digits.
- [:space:]: whitespace

For example, [[:alnum:]] means [0-9A-Za-z]. (Note that the square brackets in these class names are part of the symbolic names, and must be included in addition to the square brackets delimiting the bracket list.)

6 Metacharacters ., \w, \W, \d, \D, \s, \S

A metacharacter is a symbol with a special meaning inside a regex.

- The metacharacter dot (.) matches any single character except newline \n (same as [^\n]). For example, ... matches any 3 characters (including alphabets, numbers, whitespaces, but except newline); the.. matches "there", "these", "the ", and so on.
- \w (word character) matches any single letter, number or underscore (same as [a-zA-Z0-9_]). The uppercase counterpart \W (non-word-character) matches any single character that doesn't match by \w (same as [^a-zA-Z0-9_]).
- In regex, the uppercase metacharacter is always the inverse of the lowercase counterpart.
- \d (digit) matches any single digit (same as [0-9]). The uppercase counterpart \D (non-digit) matches any single character that is not a digit (same as [^0-9]).
- \s (space) matches any single whitespace (same as [ \t\n\r\f], blank, tab, newline, carriage-return and form-feed). The uppercase counterpart \S (non-space) matches any single character that doesn't match by \s (same as [^ \t\n\r\f]).

Examples:

\s\s # Matches two spaces \S\S\s # Two non-spaces followed by a space \s+ # One or more spaces \S+\s\S+ # Two words (non-spaces) separated by a space

7 Backslash (\) and Regex Escape Sequences

Regex uses backslash (\) for two purposes:

1. for metacharacters such as \d (digit), \D (non-digit), \s (space), \S (non-space), \w (word), \W (non-word).
2. to escape special regex characters, e.g., \. for ., \+ for +, \* for *, \? for ?. You also need to write \\ for \ in regex to avoid ambiguity.
3. Regex also recognizes \n for newline, \t for tab, etc.

Take note that in many programming languages (C, Java, Python), backslash (\) is also used for escape sequences in string, e.g., "\n" for newline, "\t" for tab, and you also need to write "\\" for \. Consequently, to write regex pattern \\ (which matches one \) in these languages, you need to write "\\\\" (two levels of escape!!!). Similarly, you need to write "\\d" for regex metacharacter \d. This is cumbersome and error-prone!!!

8 Occurrence Indicators (Repetition Operators): +, *, ?, {m}, {m,n}, {m,}

A regex sub-expression may be followed by an occurrence indicator (aka repetition operator):

- ?: The preceding item is optional and matched at most once (i.e., occurs 0 or 1 times or optional).
- *: The preceding item will be matched zero or more times, i.e., 0+
- +: The preceding item will be matched one or more times, i.e., 1+
- {m}: The preceding item is matched exactly m times.
- {m,}: The preceding item is matched m or more times, i.e., m+
- {m,n}: The preceding item is matched at least m times, but not more than n times.

For example: The regex xy{2,4} accepts "xyy", "xyyy" and "xyyyy".

9 Modifiers

You can apply modifiers to a regex to tailor its behavior, such as global, case-insensitive, multiline, etc. The ways to apply modifiers differ among languages.

In Perl, you can attach modifiers after a regex, in the form of /.../modifiers. For examples:

m/abc/i # case-insensitive matching m/abc/g # global (Match ALL instead of match first)

In Java, you apply modifiers when compiling the regex Pattern. For example,

Pattern p1 = Pattern.compile(regex, Pattern.CASE_INSENSITIVE); // for case-insensitive matching Pattern p2 = Pattern.compile(regex, Pattern.MULTILINE); // for multiline input string Pattern p3 = Pattern.compile(regex, Pattern.DOTALL); // Dot (.) matches all characters including newline

The commonly-used modifer modes are:

- Case-Insensitive mode (or i): case-insensitive matching for letters.
- Global (or g): match All instead of first match.
- Multiline mode (or m): affect ^, $, \A and \Z. In multiline mode, ^ matches start-of-line or start-of-input; $ matches end-of-line or end-of-input, \A matches start-of-input; \Z matches end-of-input.
- Single-line mode (or s): Dot (.) will match all characters, including newline.
- Comment mode (or x): allow and ignore embedded comment starting with # till end-of-line (EOL).
- more...

10 Greediness, Laziness and Backtracking for Repetition Operators

Greediness of Repetition Operators *, +, ?, {m,n}: The repetition operators are greedy operators, and by default grasp as many characters as possible for a match. For example, the regex xy{2,4} try to match for "xyyyy", then "xyyy", and then "xyy".

Lazy Quantifiers *?, +?, ??, {m,n}?, {m,}?, : You can put an extra ? after the repetition operators to curb its greediness (i.e., stop at the shortest match). For example,

input = "The <code>first</code> and <code>second</code> instances" regex = <code>.*</code> matches "<code>first</code> and <code>second</code>" But regex = <code>.*?</code> produces two matches: "<code>first</code>" and "<code>second</code>"

Backtracking: If a regex reaches a state where a match cannot be completed, it backtracks by unwinding one character from the greedy match. For example, if the regex z*zzz is matched against the string "zzzz", the z* first matches "zzzz"; unwinds to match "zzz"; unwinds to match "zz"; and finally unwinds to match "z", such that the rest of the patterns can find a match.

Possessive Quantifiers *+, ++, ?+, {m,n}+, {m,}+: You can put an extra + to the repetition operators to disable backtracking, even it may result in match failure. e.g, z++z will not match "zzzz". This feature might not be supported in some languages.

11 Position Anchors ^, $, \b, \B, \<, \>, \A, \Z

Positional anchors DO NOT match actual character, but matches position in a string, such as start-of-line, end-of-line, start-of-word, and end-of-word.

- ^ and $: The ^ matches the start-of-line. The $ matches the end-of-line excluding newline, or end-of-input (for input not ending with newline). These are the most commonly-used position anchors. For examples,
  - ing$ # ending with 'ing' ^testing 123$ # Matches only one pattern. Should use equality comparison instead. ^[0-9]+$ # Numeric string
- \b and \B: The \b matches the boundary of a word (i.e., start-of-word or end-of-word); and \B matches inverse of \b, or non-word-boundary. For examples,
  - \bcat\b # matches the word "cat" in input string "This is a cat." # but does not match input "This is a catalog."
- \< and \>: The \< and \> match the start-of-word and end-of-word, respectively (compared with \b, which can match both the start and end of a word).
- \A and \Z: The \A matches the start of the input. The \Z matches the end of the input.
- They are different from ^ and $ when it comes to matching input with multiple lines. ^ matches at the start of the string and after each line break, while \A only matches at the start of the string. $ matches at the end of the string and before each line break, while \Z only matches at the end of the string. For examples,
  - $ python3# Using ^ and $ in multiline mode >>> p1 = re.compile(r'^.+$', re.MULTILINE) # . for any character except newline >>> p1.findall('testing\ntesting') ['testing', 'testing'] >>> p1.findall('testing\ntesting\n') ['testing', 'testing'] # ^ matches start-of-input or after each line break at start-of-line # $ matches end-of-input or before line break at end-of-line # newlines are NOT included in the matches# Using \A and \Z in multiline mode >>> p2 = re.compile(r'\A.+\Z', re.MULTILINE) >>> p2.findall('testing\ntesting') [] # This pattern does not match the internal \n >>> p3 = re.compile(r'\A.+\n.+\Z', re.MULTILINE) # to match the internal \n >>> p3.findall('testing\ntesting') ['testing\ntesting'] >>> p3.findall('testing\ntesting\n') [] # This pattern does not match the trailing \n # \A matches start-of-input and \Z matches end-of-input

12 Capturing Matches via Parenthesized Back-References & Matched Variables $1, $2, ...

Parentheses ( ) serve two purposes in regex:

1. Firstly, parentheses ( ) can be used to group sub-expressions for overriding the precedence or applying a repetition operator. For example, (abc)+ (accepts abc, abcabc, abcabcabc, ...) is different from abc+ (accepts abc, abcc, abccc, ...).
2. Secondly, parentheses are used to provide the so called back-references. A back-reference contains the matched substring. For examples, the regex (\S+) creates one back-reference (\S+), which contains the first word (consecutive non-spaces) of the input string; the regex (\S+)\s+(\S+)creates two back-references: (\S+) and another (\S+), containing the first two words, separated by one or more spaces \s+.

The back-references are stored in special variables $1, $2, … (or \1, \2, ... in Python), where $1 contains the substring matched the first pair of parentheses, and so on. For example, (\S+)\s+(\S+) creates two back-references which matched with the first two words. The matched words are stored in $1 and $2 (or \1 and \2), respectively.

Back-references are important to manipulate the string. For example, the following Perl expression swap the first and second words separate by a space:

s/(\S+) (\S+)/$2 $1/; # Swap the first and second words separated by a single space

13 (Advanced) Lookahead/Lookbehind, Groupings and Conditional

These feature might not be supported in some languages.

Positive Lookahead (?=pattern)

The (?=pattern) is known as positive lookahead. It performs the match, but does not capture the match, returning only the result: match or no match. It is also called assertion as it does not consume any characters in matching. For example, the following complex regex is used to match email addresses by AngularJS:

^(?=.{1,254}$)(?=.{1,64}@)[-!#$%&'*+/0-9=?A-Z^_`a-z{|}~]+(\.[-!#$%&'*+/0-9=?A-Z^_`a-z{|}~]+)*@[A-Za-z0-9]([A-Za-z0-9-]{0,61}[A-Za-z0-9])?(\.[A-Za-z0-9]([A-Za-z0-9-]{0,61}[A-Za-z0-9])?)*$

The first positive lookahead patterns ^(?=.{1,254}$) sets the maximum length to 254 characters. The second positive lookahead ^(?=.{1,64}@) sets maximum of 64 characters before the '@' sign for the username.

Negative Lookahead (?!pattern)

Inverse of (?=pattern). Match if pattern is missing. For example, a(?=b) matches 'a' in 'abc' (not consuming 'b'); but not 'acc'. Whereas a(?!b) matches 'a' in 'acc', but not abc.

Positive Lookbehind (?<=pattern)

[TODO]

Negative Lookbehind (?<!pattern)

[TODO]

Non-Capturing Group (?:pattern)

Recall that you can use Parenthesized Back-References to capture the matches. To disable capturing, use ?: inside the parentheses in the form of (?:pattern). In other words, ?: disables the creation of a capturing group, so as not to create an unnecessary capturing group.

Example: [TODO]

Named Capturing Group (?<name>pattern)

The capture group can be referenced later by name.

Atomic Grouping (>pattern)

Disable backtracking, even if this may lead to match failure.

Conditional (?(Cond)then|else)

[TODO]

14 Unicode

The metacharacters \w, \W, (word and non-word character), \b, \B (word and non-word boundary) recongize Unicode characters.

Retirado de www.ntu.edu.sg/home/ehchua/programming/howto/Regexe.html