Regex
Author: Mahesh Yadav Gaddam
Editor: Pavan Devarakonda
Regular Expressions
The regular expressions for any Middleware Administrative is an essential skill. They are very simple and its crafting makes you an effective admin. Most of SOA based environments using multimega-byte files for processing between different services. As WebService Admin you might be working with File Adapters and huge size files, there could be need of validation of data feild of such files. This article will give you an overiew on regex syntax and examples with samples. In UNIX shell programming the regular expressions are used to search for text in files. Remember that in UNIX everything can be treated as a file. The regular expressions can be used with the combination of the grep command, Stream manipulation: sed and xargs
xargs
UNIX limits the size of arguments and environment that can be passed down to child process
What happens when we have a list of 10,000 files to send to a command?
o xargs handles this problem
o Reads arguments as standard input
o Sends them to commands that take file lists
o May invoke program several times depending on size of arguments
The find utility command and xargs combination
find . -type f -print | xargs wc -l
-type f for files
-print to print them out
xargs invokes wc 1 or more times
wc -l a b c d e f g
wc -l h i j k l m n o
…
Compare to: find . -type f –exec wc -l {} \;
The -n option can be used to limit number of args
What is a Regular Expression?
A regular expression (regex) describes a set of possible input strings.
Regular expressions descend from a fundamental concept in Computer Science called finite automata theory
Regular expressions are endemic to UNIX
vi, ed, sed, and emacs
awk, tcl, perl and Python
grep, egrep, fgrep
compilers
The simplest regular expressions are a string of literal characters to match.
The string matches the regular expression if it contains the substring.
A regular expression can match a string in more than one place.
The . regular expression can be used to match any character.
Character Classes
Character classes [] can be used to match any specific set of characters.
Negated Character Classes
Character classes can be negated with the [^] syntax.
[aeiou] will match any of the characters a, e, i, o, or u
[kK]orn will match korn or Korn
Ranges can also be specified in character classes
[1-9] is the same as [123456789]
[abcde] is equivalent to [a-e]
You can also combine multiple ranges
[abcde123456789] is equivalent to [a-e1-9]
Note that the - character has a special meaning in a character class but only if it is used within a range,
[-123] would match the characters -, 1, 2, or 3
Named Character Classes
Commonly used character classes can be referred to by name (alpha, lower, upper, alnum, digit, punct, ctrl)
Syntax [:name:]
[a-zA-Z] [[:alpha:]]
[a-zA-Z0-9] [[:alnum:]]
[45a-z] [[45[:lower:]]
Important for portability across languages
Anchors
Anchors are used to match at the beginning or end of a line (or both).
^ means beginning of the line
$ means end of the line
Repetition
The * is used to define zero or more occurrences of the single regular expression preceding it.
Match length
A match will be the longest string that satisfies the regular expression.
Repetition Ranges
You can have ranges in the regurlar expressions it can be specified as follows
{ } notation can specify a range of repetitions for the immediately preceding regex
{n} means exactly n occurrences
{n,} means at least n occurrences
{n,m} means at least n occurrences but no more than m occurrences
Example:
.{0,} same as .*
a{2,} same as aaa*
Sub-expressions
If you want to group part of an expression so that * or { } applies to more than just the previous character, use ( ) notation
Subexpresssions are treated like a single character
a* matches 0 or more occurrences of a
abc* matches ab, abc, abcc, abccc, …
(abc)* matches abc, abcabc, abcabcabc, …
(abc){2,3} matches abcabc or abcabcabc
References: