VB/VBA regular expression

1. Definition

Regular expressions (RE) are used for pattern matching purpose in programming. Most important cases that RE are used include string searching and replacement. The RE facility exists in many programming languages, including VB, VBA, VBScript, Javascript, Delphi, C#, VB.NET, Java, Perl, PHP etc.

2. Quick view on basic principles

- : means a range. E.g., a-z means all the letters a to z. [] : means exactly one of the letters quoted. E.g., [abc] means exactly one of a, b or c. [A-Z] means exactly one of A, B, ..., Z () : just for grouping purpose. | : means "or". E.g. a|b means a or b. + : means one or more of the pattern before it. E.g., a+ means one or more consequtive a. (ab)+ means ab, abab, ababab, .. * : means zero or more. E.g., [a-z]* means empty string, or any string consisting of a, .., z. ? : means zero or one. E.g., [a-z]? means empty string, or any one of a, b, .., z.

3. Principles by classification

3.1 Single-character patterns

. : a single dot. Matches any character except newline (\n). E.g., "x." matches a string of length 2, first letter is 'x', second is any letter except '\n'. [] : character class, a list of characters enclosed by '[' and ']'. Exactly one of these characters is used. E.g., [abcdef], [a-z], [A-Z], [0123456789], [0-9], [a-zA-Z], [a-zA-Z0-9_]. ^ : used at the beginning of a character class, means the negated character class. E.g., [^0-9], [^abcdABCD], [^a-z], [^\^]. Some predefined character class abbreviations: abr: same as | meaning ------------------------------ \d : [0-9] : a digit \D : [^0-9] : non-digits \w : [a-zA-Z0-9_] : word char \W : [^a-zA-Z0-9_] : non-word char \s : [ \r\t\n\f] : space char \S : [^ \r\t\n\f] : non-space char \n : [\n] : new line

3.2 Grouping patterns

- Sequence: a continuous series of patterns. E.g., abc, 123456 - Multiplier, repeated copies of the immediately previous pattern 1) {min, max} : general multiplier: min to max copies {n} : exactly n copies. 2) * : zero or more. same as {0,}. 3) + : one or more. same as {1,}. 4) ? : zero or one. same as {0,1}. E.g., [a]*, a+, a?, [a]{0,}, a{1,}, a{0,1}. - Memory: remember the pattern quoted by '(' and ')', the remembered patterns are refered to as \1, \2, \3, ... in the pattern string, and also can be refered to by $1, $2, $3, ... later. E.g., fred(.)barney\1, this matches "fredxbarneyx". and now the value of $1 is "x". - Alternation: match exactly one of the alternatives. 1) | E.g., red|green|blue

3.3 Anchoring patterns

1) ^ : marks the beginning of a pattern. E.g., ^a, the 'a' must appear at the beginning of the matched string. 2) $ : marks the end of a pattern. E.g., a$, the 'a' must appear at the end of the matched string. 3) \b : marks a word boundary (a spot between 2 characters that has a \w on one side of it, and a \W on the other side). 4) \B : marks a non-boundary of word. 3) more ...

3.4 Other issues

- Escape sequence: To escape special charaters, e.g., those used above. E.g., \., \[, \], \(, \), \\, \*, \+, \?, \^, \$. - Case sensitivity: Specify whether case (upper case or lower case) matters. - Global v.s. Local: Mostly used for replacing: replace the first match (local) or all the matches (global)? - Precedence: which combination of the patterns match first? e.g., a|b* - is this (a|b)* or a|(b*) ? The precedence table of most common regex grouping patterns is: ----------------------------------------------------- Name | Representation ----------------------------------------------------- 1) Parentheses | ( ) 2) Multipliers | ? + * {m,n} {m, n}? 3) Sequence and anchoring | abc ^ $ 4) Alternation | | -----------------------------------------------------

4. Limitations of regular expression

- Some patterns cannot be matched by regular expressions. One example involves counting the number of occurences of a pattern. E.g. It can't match a pattern anbn, where a and b are both repeated n times. This is because regular expression can't keep track of the number of occurences. Similarly, anbncn ... also cannot be matched. - When there are lots of data to process, improper use of the regular expression pattern may lead to low efficiency of the program.

5. Example use in VBA

See Demo files COF.RegExp.xls and RegExp_Sept_1_2004.xls. The latter also comes with a documentation: RegExp_Sept_1_2004.doc.

References:

  1. Mastering Regular Expressions. By Jeffrey E. F. Friedl. 1st Ed. 1997. ISBN 10: 1-56592-257-3
  2. http://en.wikipedia.org/wiki/Regular_expression
  3. Excel VBA Regular Expression example
  4. Perl regular expression
  5. www.google.com

XC. Created on: 7/1/2008, Last modified: 7/14/2008