regex

- Character classes: . [abc] [a-z] \d \w \s
  - . means "any character"
  - \d means "a digit"
  - \w means "a word character", [0-9A-Za-z_]
  - \s means "a space, tab, carriage return or line feed character"
  - Negated character classes: [^abc] \D \W \S
- Multipliers: {4} {3,16} {1,} ? * +
  - ? means "zero or one"
  - * means "zero or more"
  - + means "one or more"
  - Multipliers are greedy unless you put a ? afterwards
- Alternation and grouping: (Septem|Octo|Novem|Decem)ber
- Word, line and text boundaries: \b ^ $ \A \z
- To refer back to a capture groups: \1 \2 \3 etc. (works in both replacement expressions and matching expressions)
- List of metacharacters: . \ [ ] { } ? * + | ( ) ^ $
- List of metacharacters when inside a character class: [ ] \ - ^
- You can always escape a metacharacter using a backslash: \

Matching URLs

Let's assume that we have defined a string:

s = 'I love to visit https://example.com/foo.html every day! More than http://abc-def.co.il/.'

Can you write a regular expression that will match both of these URLs, but not the characters before or after them? We want to include the "/foo.html" in the first URL, but not the training period (.) in the second.

Solution

We often think of URLs are fairly simple. However, matching them can be a bit tricky, because of several variations in the URLs we see here. For example, the first begins with "https://", and the second begins with "http://". The first ends with a filename (including a ".html" suffix), while the second has a hostname containing a - character.

Starting from the beginning, we can match the URLs with "https?://". The ? metacharacter indicates that the character preceding it ("s") is optional, and can appear zero or one times. While URLs can start with any number of different protocol names, this particular exercise only required that we match "http" and "https" at the start.

We then need to match the hostname. We don't want to match every possible character, since not all characters are valid in hostnames. I'm going to assume, for these purposes, that hostnames might contain letters, numbers, underscores, and dashes. We also need to take into account the periods that will appear in the URL, And, of course, they might contain periods as well, separating the host from the domain. (The solution I'm presenting here would also match illegal URLs, such as those containing two consecutive . characters.) We can shorten this character class definition by using the built-in \w character class, which is defined to be the same as [A-Za-z0-9_].

Note that if we want to create a character class that'll match \w, ., /, and -, then the - character will need to be at the start or end of the character class. Otherwise, it'll be interpreted as defining a range. Also note that . inside of a character class is treated literally, not as a metacharacter. We'll match any number of these characters, indicated by using a + sign following our character class.

Our URL then ends with a repeat of our character class, but without any . inside (since our URL cannot end with it). This ensures that we won't match training punctuation marks.

Given all of this, our regular expression could be:

https?://[\w./-]+[\w/-]

Python

Because we have more than one URL to find in the text, we'll use Python's re.findall method, which returns a list of all found matches:

import re s = 'I love to visit https://example.com/foo.html every day! More than http://abc-def.co.il/.' re.findall(s, 'https?://[\w./-]+[\w/-]')

Ruby

In Ruby, if we want to find more than one match to our regexp, we invoke String#scan.

r = Regexp.new('https?://[\w./-]+[\w/-]') s.scan(r)

This produces an array of matching strings. It's true that we can create a Ruby Regexp object using the Perl-style slash syntax, but I prefer to use Regexp.new.

JavaScript

https://www.stanleycyang.com/tutorials/understanding-the-fundamentals-of-regular-expressions

In JavaScript, we can define our RegExp object, r, such that it applies globally. We do this by passing the 'g' flag when we create the regexp; a Unix-style regexp (with slashes and a trailing "g") is the printed representation. We can then invoke r.exec(s) multiple times; each invocation will return the next match, or `null` if we have reached the end of the matches:

js> s = 'I love to visit https://example.com/foo.html every day! More than http://abc-def.co.il/.' "I love to visit https://example.com/foo.html every day! More than http://abc-def.co.il/." js> r = RegExp('https?://[\\w./-]+[\\w/-]', 'g') /https?:\/\/[\w.\/-]+[\w\/-]/g js> r.exec(s) ["https://example.com/foo.html"] js> r.exec(s) ["http://abc-def.co.il/"] js> r.exec(s) null

Page updated

Google Sites

Report abuse