Tutorials‎ > ‎

Regex Tutorial


Introduction

Perl Regular Expressions, or regex, is a very powerful form of pattern matching used in both zMUD and MUSHclient. In both, you can select a checkbox to turn a normal trigger into regex. And in both, regex is much more powerful than the pattern matching the client natively employs. This tutorial is designed to explain the basics  of regex. Enjoy!


Tutorial

Characters versus Metacharacters

A character is any single letter, number, or symbol. Most characters are taken literally. For example, the pattern Trevize greets you matches exactly that. But that's no better than even the basic pattern matching of the simplest clients. The true advantage in using regular expressions comes from the use of metacharacters, which aren't taken literally.

Period

The most basic 'wildcard' is the period metacharacter. It will match, when not escaped (see next section), any single character. m.n will match man and men, but not mean.

Backslash

The most common metacharacter is the blackslash, but it has no meaning by itself. It serves many purposes, two of which I will discuss here. First, it is an escape character. If placed before any symbol, it will be taken literally. If you wanted to match a period, you would write it as \. in the pattern. If you wanted to match a plus sign, you would write it as \+ in the pattern. This goes for the backslash itself too. If you wanted to match a backslash, you would write it as \\ in the pattern. It is always safe to do this before any symbol, even if it is not a metacharacter. If in doubt, precede it by a backslash. So to add a period to the pattern in the first section, we would write it as: Trevize greets you\.

Second, it can be used for basic 'wildcards', or generic character types. A list of these follows.

\d
any number
\s
any whitespace
\w
any letter, number, or underscore
\D
anything not a number
\S
anything not whitespace
\W
anything not a letter, number, or underscore

These match one, and only one, character of the sort described. For example, \d by itself will only match one number.

Basic Assertions

Assertions are different from normal parts of patterns in one very unique way. They don't consume characters. What this means is they are used to specify a condition instead of actually matching anything. A list of basic assertions follows.

^beginning of a line
$end of a line
\ba word boundary
\B anything not a word boundary

The most common assertions are the circumflex and dollar. They are used to specify the beginning and end of a line. Starting a trigger with a circumflex and ending it with a dollar is often called 'anchoring' the trigger. This will prevent it from firing in the middle of something, like hearing it over a tell or a channel. If we wanted to anchor the trigger we created in section three, it would look like this:

 ^Trevize greets you\.$

That trigger will only fire if there is nothing before Trevize and nothing after the period. A word boundary is where there are not two adjacent characters that would match \w. So why not use \W instead? Because \b is an assertion, which means doesn't actually match anything, but just requires the boundary be there. If you used \W, for example, it wouldn't match the end or beginning of a line. The pattern Trevize\b would match Trevize but not Trevizes or Trevizer.

Character Classes

A character class matches any single character defined by the class. It begins with an opening square bracket and ends with a closing square bracket. Everything between those brackets is a potential character the class can match. In the class, you can include normal characters and escaped metacharacters. If a circumflex is the first character in the class, it means the class will match anything except what is in the class. A dash serves to specify a range. Here are a few examples:

[abcde] matches a, b, c, d, or e
[^abcde] matches anything but a, b, c, d, or e
[a-e] same as [abcde]
[0-9] same as \d
[a-zA-Z0-9_] same as \w

Their power lies in being able to be very specific or as broad as you want. For example, we used m.n earlier to match man and men. But it would also match min or mmn, which we do not want. Yet \w is no better. Instead, we could use a character class here to specify just a or e. The pattern m[ae]n would do this perfectly.

Subpatterns and the Vertical Bar

Subpatterns are just that, a small pattern within a larger one. A subpattern is anything surrounded by parentheses. They have many uses. Most importantly, plain parentheses are 'capturing' subpatterns. In zMUD and MUSHclient, that means they pass whatever is inside them back to the script as %1-9. For example, if you wanted to capture a dice roll, you could use \d inside a subpattern. The pattern:

^You roll the dice, and get (\d) and (\d)\.$

Would send the two numbers to the script as %1 and %2. If you wanted to give several options for a subpattern, you can split it in two with a vertical bar. The pattern:

^You are (tall|short)\.$

Would match with either tall or short. It would also send whichever matched to %1, which we may not want! If you make a question mark then a colon the first two things in a subpattern, it will group but it will not capture. So we could fix that pattern like this:

^You are (?:tall|short)\.$

And ta-da! Perfection. You can have as many vertical bars as you want, and you can have subpatterns inside subpatterns. You can also make a vertical bar the last thing in the subpattern, like this:

^You are (?:very |not |)short\.$

And it would match very short, not short, and just plain short.

Quantifiers

Finally, one of the most useful aspects of regex is repetition. Repetition is specified by quantifiers. A quantifier may follow any single character, a period, an escaped character type, a character class, or a subpattern (except assertions). A quantifier is normally surrounded by curly braces and specifies an exact amount or a minimum and a maximum. There are three shorthand quantifiers. Examples:

{3}matches exactly three times
{1,2}matches between one and two times
{3,}
matches at least three times, with no maximum
+same as {1,} (one or more times)
*same as {0,} (zero or more times)
?same as {0,1} (zero or one time)

What does this mean? Well, z{10} would match zzzzzzzzzz, and z{2,4} would match zz, zzz, and zzzz. The pattern \d+ would match any amount of numbers, as long as there is at least one. You could use [abc]? to match a, b, c, or nothing.

One common usage is an? to match a or an. In the last example of the subpatterns section above, you could use a quantifier to indicate the possibility of neither word instead, as well as moving the space outside the subpattern and indicating it could also not exist. Like so:

^You are (?:very|not)? ?short\.$


References

http://www.gammon.com.au/mushclient/regexp.htm
http://mushclient.com/pcre/pcrepattern.html

Sign in  |  Recent Site Activity  |  Terms  |  Report Abuse  |  Print page  |  Powered by Google Sites