Ling
431/631: Corpus Linguistics
Ben
Bergen
Meeting 3: Regular Expressions
August 29, 2007
Regular
expressions
Regular expressions are symbols that stand for other symbols or
strings of symbols
The set used by
TextSTAT is the Python "re" set, which is the same (as far as I can
tell) as the Perl set. Some of the most important components are on page 2
(others at: http://docs.python.org/lib/re-syntax.html)
Defining some
patterns
1.
All forms of the
verb "live"
2.
All forms of the
verb "be"
3.
The words
"so" or "such", followed by up to five words, followed by
"that"
4.
Any word with the
prefix "un" and the suffix "able"
5.
Any word with any allomorph
of the prefix "in" and any allomorph the suffix "able"
6.
Any word that has
two sequences of repeated letters in a row.
7.
Any word that has
two or more sequences of two or more letters repeated, in a row.
8.
"a" or
"the", followed by any number of words, followed by any variant of
the word "lion"
9.
Any word that is in
all capital letters
10. Any word that has a number in it in any position
11. Any word that starts with a sibilant, and ends
with a vowel if the penultimate letter is "r", or ends with a
consonant if the penultimate consonant is "t"
For a more complete
introduction to regular expressions in python, check this out: http://www.amk.ca/python/howto/regex/
Regular expressions
in TextSTAT and Python
|
. |
any character except
a newline |
|
^ |
the start of the
string |
|
$ |
the end of the string
or just before the newline at the end of the string. foo matches both 'foo'
and 'foobar', while the regular expression foo$ matches only 'foo'. |
|
* |
causes the resulting
RE to match 0 or more repetitions of the preceding RE, as many repetitions as
are possible. ab* will match 'a', 'ab', or 'a' followed by any number of
'b's. |
|
+ |
causes the resulting
RE to match 1 or more repetitions of the preceding RE. ab+ will match 'a'
followed by any non-zero number of 'b's; it will not match just 'a'. |
|
? |
causes the resulting
RE to match 0 or 1 repetitions of the preceding RE. ab? will match either 'a'
or 'ab' |
|
*?, +?, ?? |
the "*",
"+", and "?" qualifiers are all greedy; they match as
much text as possible. Sometimes this behaviour isn't desired; if the RE
<.*> is matched against '<H1>title</H1>', it will match the
entire string, and not just '<H1>'. Adding "?" after the
qualifier makes it perform the match in non-greedy or minimal fashion; as few
characters as possible will be matched. Using .*? in the previous expression
will match only '<H1>'. |
|
{m} |
specifies that
exactly m copies of the previous RE should be matched; fewer matches cause the
entire RE not to match. For example, a{6} will match exactly six
"a" characters, but not five. |
|
{m,n} |
causes the resulting
RE to match from m to n repetitions of the preceding RE, attempting to match
as many repetitions as possible. For example, a{3,5} will match from 3 to 5
"a" characters. Omitting m specifies a lower bound of zero, and
omitting n specifies an infinite upper bound. As an example, a{4,}b will
match aaaab or a thousand "a" characters followed by a b, but not
aaab. The comma may not be omitted or the modifier would be confused with the
previously described form. |
|
\ |
either escapes
special characters (permitting you to match characters like "*",
"?", and so forth), or signals a special sequence; special
sequences are discussed below. |
|
[ ] |
used to indicate a
set of characters. Characters can be listed individually, or a range of
characters can be indicated by giving two characters and separating them by a
"-". Special characters are not active inside sets. For example, [akm$]
will match any of the characters "a", "k", "m",
or "$"; [a-z] will match any lowercase letter, and [a-zA-Z0-9]
matches any letter or digit. You can match the characters not within a range
by complementing the set. This is indicated by including a "^" as
the first character of the set; "^" elsewhere will simply match the
"^" character. For example, [^5] will match any character except
"5", and [^^] will match any character except "^". |
|
| |
A|B, where A and B can
be arbitrary REs, creates a regular expression that will match either A or B.
An arbitrary number of REs can be separated by the "|" in this way. |
|
(?=...) |
matches if ...
matches next, but doesn't consume any of the string. This is called a
lookahead assertion. For example, Isaac (?=Asimov) will match 'Isaac '
only if it's followed by 'Asimov'. |
|
(?!...) |
matches if ...
doesn't match next. This is a negative lookahead assertion. For example,
Isaac (?!Asimov) will match 'Isaac ' only if it's not followed by
'Asimov'. |
|
\A |
matches only at the
start of the string. |
|
\b |
matches the empty
string, but only at the beginning or end of a word. A word is defined as a
sequence of alphanumeric or underscore characters, so the end of a word is
indicated by whitespace or a non-alphanumeric, non-underscore character.. |
|
\s |
matches any
whitespace character; this is equivalent to the set [ \t\n\r\f\v]. |
|
\w |
matches any alphanumeric
character and the underscore; this is equivalent to the set [a-zA-Z0-9_]. |
|
\Z |
matches only at the
end of the string. |