Ling 431/631: Corpus Linguistics

Ben Bergen

 

Meeting 3: Regular Expressions

August 29, 2007

 

Regular expressions

 

Regular expressions are symbols that stand for other symbols or strings of symbols

 

The set used by TextSTAT is the Python "re" set, which is the same (as far as I can tell) as the Perl set. Some of the most important components are on page 2 (others at: http://docs.python.org/lib/re-syntax.html)

 

Defining some patterns

 

1.        All forms of the verb "live"

 

2.        All forms of the verb "be"

 

3.        The words "so" or "such", followed by up to five words, followed by "that"

 

4.        Any word with the prefix "un" and the suffix "able"

 

5.        Any word with any allomorph of the prefix "in" and any allomorph the suffix "able"

 

6.        Any word that has two sequences of repeated letters in a row.

 

7.        Any word that has two or more sequences of two or more letters repeated, in a row.

 

8.        "a" or "the", followed by any number of words, followed by any variant of the word "lion"

 

9.        Any word that is in all capital letters

 

10.    Any word that has a number in it in any position

 

11.    Any word that starts with a sibilant, and ends with a vowel if the penultimate letter is "r", or ends with a consonant if the penultimate consonant is "t"

 

For a more complete introduction to regular expressions in python, check this out: http://www.amk.ca/python/howto/regex/


Regular expressions in TextSTAT and Python

 

.

any character except a newline

^

the start of the string

$

the end of the string or just before the newline at the end of the string. foo matches both 'foo' and 'foobar', while the regular expression foo$ matches only 'foo'.

*

causes the resulting RE to match 0 or more repetitions of the preceding RE, as many repetitions as are possible. ab* will match 'a', 'ab', or 'a' followed by any number of 'b's.

+

causes the resulting RE to match 1 or more repetitions of the preceding RE. ab+ will match 'a' followed by any non-zero number of 'b's; it will not match just 'a'.

?

causes the resulting RE to match 0 or 1 repetitions of the preceding RE. ab? will match either 'a' or 'ab'

*?, +?, ??

the "*", "+", and "?" qualifiers are all greedy; they match as much text as possible. Sometimes this behaviour isn't desired; if the RE <.*> is matched against '<H1>title</H1>', it will match the entire string, and not just '<H1>'. Adding "?" after the qualifier makes it perform the match in non-greedy or minimal fashion; as few characters as possible will be matched. Using .*? in the previous expression will match only '<H1>'.

{m}

specifies that exactly m copies of the previous RE should be matched; fewer matches cause the entire RE not to match. For example, a{6} will match exactly six "a" characters, but not five.

{m,n}

causes the resulting RE to match from m to n repetitions of the preceding RE, attempting to match as many repetitions as possible. For example, a{3,5} will match from 3 to 5 "a" characters. Omitting m specifies a lower bound of zero, and omitting n specifies an infinite upper bound. As an example, a{4,}b will match aaaab or a thousand "a" characters followed by a b, but not aaab. The comma may not be omitted or the modifier would be confused with the previously described form.

\

either escapes special characters (permitting you to match characters like "*", "?", and so forth), or signals a special sequence; special sequences are discussed below.

[ ]

used to indicate a set of characters. Characters can be listed individually, or a range of characters can be indicated by giving two characters and separating them by a "-". Special characters are not active inside sets. For example, [akm$] will match any of the characters "a", "k", "m", or "$"; [a-z] will match any lowercase letter, and [a-zA-Z0-9] matches any letter or digit. You can match the characters not within a range by complementing the set. This is indicated by including a "^" as the first character of the set; "^" elsewhere will simply match the "^" character. For example, [^5] will match any character except "5", and [^^] will match any character except "^".

|

A|B, where A and B can be arbitrary REs, creates a regular expression that will match either A or B. An arbitrary number of REs can be separated by the "|" in this way.

(?=...)

matches if ... matches next, but doesn't consume any of the string. This is called a lookahead assertion. For example, Isaac (?=Asimov) will match 'Isaac ' only if it's followed by 'Asimov'.

(?!...)

matches if ... doesn't match next. This is a negative lookahead assertion. For example, Isaac (?!Asimov) will match 'Isaac ' only if it's not followed by 'Asimov'.

\A

matches only at the start of the string.

\b

matches the empty string, but only at the beginning or end of a word. A word is defined as a sequence of alphanumeric or underscore characters, so the end of a word is indicated by whitespace or a non-alphanumeric, non-underscore character..

\s

matches any whitespace character; this is equivalent to the set [ \t\n\r\f\v].

\w

matches any alphanumeric character and the underscore; this is equivalent to the set [a-zA-Z0-9_].

\Z

matches only at the end of the string.