Regular Expressions and Their Applications

Section 17.2 Regular Expressions and Their Applications

Subsection 17.2.1 Regular Expressions

Though we have used the term string throughout to refer to a sequence of symbols from an alphabet, an alternative term that is frequently used is word. The analogy seems fairly obvious: strings are made up of ``letters" from an alphabet, just as words are in human languages like English. In English, however, there are no particular rules specifying which sequences of letters can be used to form legal English words---even unlikely combinations like ghth and ckstr have their place. While some formal languages may simply be random collections of arbitrary strings, more interesting languages are those where the strings in the language all share some common structure:

\begin{gather*} L_1 = \{ x\in \ab^* \ | n_a(x) = n_b(x)\};\\ L_2 = \{\text{legal Java identifiers}\};\\ L_3 = \{\text{legal C++ programs}\}. \end{gather*}

In all of these languages, there are structural rules which determine which sequences of symbols are in the language and which aren’t. So despite the terminology of ``alphabet" and ``word" in formal language theory, the concepts don’t necessarily match ``alphabet" and ``word" for human languages. A better parallel is to think of the alphabet in a formal language as corresponding to the words in a human language; the words in a formal language correspond to the sentences in a human language, as there are rules (grammar rules) which determine how they can legally be constructed.

One way of describing the grammatical structure of the strings in a language is to use a mathematical formalism called a regular expression. A regular expression is a pattern that ``matches" strings that have a particular form. For example, consider the language (over alphabet $\Sigma = \ab$) $L= \{x \ | \ x \text{ starts and ends with } a\}\text{.}$ What is the symbol-by-symbol structure of strings in this language? Well, they start with an $a\text{,}$ followed by zero or more $a$’s or $b$’s or both, followed by an $a\text{.}$ The regular expression $a \cdot (a \REOR b)^* \cdot a$ is a pattern that captures this structure and matches any string in $L$ ($\cdot$ and $^*$ have their usual meanings, and $\REOR$ designates or.) Conversely, consider the regular expression $(a\cdot(a\REOR b)^*) \REOR ((a\REOR b)^*\cdot a)\text{.}$ This is a pattern that matches any string that either has the form ``$a$ followed by zero or more $a$’s or $b$’s or both" (i.e. any string that starts with an $a$) or has the form ``zero or more $a$’s or $b$’s or both followed by an $a$" (i.e. any string that ends with an $a$). Thus the regular expression generates the language of all strings that start or end (or both) in an $a\text{:}$ this is the set of strings that match the regular expression.

Here are the formal definitions of a regular expression and the language generated by a regular expression:

Definition 17.2.1. Regular Expression.

Let $\Sigma$ be an alphabet. Then the following patterns are regular expressions over $\Sigma\text{:}$

$\Phi$ and $\varep$ are regular expressions;
$a$ is a regular expression, for each $a \in \Sigma\text{;}$
if $r_1$ and $r_2$ are regular expressions, then so are $r_1\REOR r_2$ , $r_1\cdot r_2\text{,}$ $r_1^*$ and $(r_1)$ (and of course, $r_2^*$ and $(r_2)$). As in concatenation of strings, the $\cdot$ is often left out of the second expression. (Note: the order of precedence of operators, from lowest to highest, is $\REOR$ , $\cdot\text{,}$ $*\text{.}$)

No other patterns are regular expressions.

Definition 17.2.2. Language Generated by a Regular Expression.

The language generated by a regular expression $r\text{,}$ denoted $L(r)\text{,}$ is defined as follows:

$L(\Phi) = \emptyset,$ i.e. no strings match $\Phi\text{;}$
$L(\varep) = \{\varep\}\text{,}$ i.e. $\varep$ matches only the empty string;
$L(a) = \{a\}\text{,}$ i.e. $a$ matches only the string $a\text{;}$
$L(r_1\REOR r_2) = L(r_1) \cup L(r_2)\text{,}$ i.e. $r_1\REOR r_2$ matches strings that match $r_1$ or $r_2$ or both;
$L(r_1r_2) = L(r_1)L(r_2)\text{,}$ i.e. $r_1r_2$ matches strings of the form ``something that matches $r_1$ followed by something that matches $r_2$";
$L(r_1^*) = (L(r_1))^*\text{,}$ i.e. $r_1^*$ matches sequences of 0 or more strings each of which matches $r_1\text{.}$
$L((r_1)) = L(r_1)\text{,}$ i.e. $(r_1)$ matches exactly those strings matched by $r_1\text{.}$

Example 17.2.3.

$\Sigma = \ab\text{,}$$r=a^*b^*\text{.}$$L(r)\text{?}$$L(a) = \{a\}$$L(a^*) = (L(a))^* = \{a\}^*\text{,}$$\{a\}^*$$a$$L(a^*) = \{\varep, a, aa, aaa, \ldots\}\text{.}$$L(b^*) = \{\varep, b, bb, bbb, \ldots\}\text{.}$$L(a^*b^*) = L(a^*)L(b^*) = \{xy \ | \ x\in L(a^*)\land y\in L(b^*)\}\text{,}$$L(a^*b^*) = \{\varep, a, b, aa, ab, bb, aaa, aab, abb, bbb, \ldots\}\text{,}$$a$$b$

Example 17.2.4.

$\Sigma = \ab\text{,}$$r=(a\REOR aa\REOR aaa)(bb)^*\text{.}$$L(a) = \{a\}\text{,}$$L(aa) = L(a)L(a) = \{aa\}\text{.}$$L(aaa) = \{aaa\}$$L(bb) = \{bb\}\text{.}$$L(a\REOR aa\REOR aaa) = L(a) \cup L(aa) \cup L(aaa) = \{a, aa, aaa\},$$L((bb)^*) = (L((bb)))^* = (L(bb))^*$Definition 17.2.2$(L(bb))^* = \{bb\}^* = \{\varep, bb, bbbb, \ldots\}\text{.}$$L(r)$$a$$aa$$aaa$$b$

Definition 17.2.5. Regular Language.

A language is regular if it is generated by a regular expression.

Clearly the union of two regular languages is regular; likewise, the concatenation of regular languages is regular; and the Kleene closure of a regular language is regular. It is less clear whether the intersection of regular languages is always regular; nor is it clear whether the complement of a regular language is guaranteed to be regular. These are questions that will be taken up in Section 17.4.

Regular languages, then, are languages whose strings’ structure can be described in a very formal, mathematical way. The fact that a language can be ``mechanically" described or generated means that we are likely to be able to get a computer to recognize strings in that language. We will pursue the question of mechanical language recognition in Section 17.3, and subsequently will see that our first attempt to model mechanical language recognition does in fact produce a family of ``machines" that recognize exactly the regular languages. But first, in the next section, we will look at some practical applications of regular expressions.

Subsection 17.2.2 Application: Using Regular Expressions

A common operation when editing text is to search for a given string of characters, sometimes with the purpose of replacing it with another string. Many ``search and replace’’ facilities have the option of using regular expressions instead of simple strings of characters. A regular expression describes a language, that is, a set of strings. We can think of a regular expression as a pattern that matches certain strings, namely all the strings in the language described by the regular expression. When a regular expression is used in a search operation, the goal is to find a string that matches the expression. This type of pattern matching is very useful.

The ability to do pattern matching with regular expressions is provided in many text editors. Programming languages often come with libraries for working with regular expressions. Java (as of version 1.4) provides regular expression handling though a package named java.util.regexp. C++ typically provides a header file named regexp.h for the same purpose. In all these applications, many new notations are added to the syntax to make it more convenient to use. The syntax can vary from one implementation to another, but most implementations include the capabilities discussed in this section.

In applications of regular expressions, the alphabet usually includes all the characters on the keyboard. This leads to a problem, because regular expressions actually use two types of symbols: symbols that are members of the alphabet and special symbols such a ``*’’ and ``)’’ that are used to construct expressions. These special symbols, which are not part of the language being described but are used in the description, are called meta-characters. The problem is, when the alphabet includes all the available characters, what do we do about meta-characters? If the language that we are describing uses the ``*’’ character, for example, how can we represent the Kleene star operation?

The solution is to use a so-called ``escape character,’’ which is usually the backslash, \. We agree, for example, that the notation \* refers to the symbol * that is a member of the alphabet, while * by itself is the meta-character that represents the Kleene star operation. Similarly, ( and ) are the meta-characters that are used for grouping, while the corresponding characters in the language are written as $ and $. For example, a regular expression that matches the string a*b repeated any number of times would be written: (a\*b)*. The backslash is also used to represent certain non-printing characters. For example, a tab is represented as \t and a new line character is \n.

We introduce two new common operations on regular expressions and two new meta-characters to represent them. The first operation is represented by the meta-character +: If r is a regular expression, then r+ represents the occurrence of r one or more times. The second operation is represented by ?: The notation r? represents an occurrence of r zero or one times. That is to say, r? represents an optional occurrence of r. Note that these operations are introduced for convenience only and do not represent any real increase in the power. In fact, r+ is exactly equivalent to rr*, and r? is equivalent to (r|$\varep$) (except that in applications there is generally no equivalent to $\varep$).

To make it easier to deal with the large number of characters in the alphabet, character classes are introduced. A character class consists of a list of characters enclosed between brackets, [ and ]. (The brackets are meta-characters.) A character class matches a single character, which can be any of the characters in the list. For example, [0123456789] matches any one of the digits 0 through 9. The same thing could be expressed as (0|1|2|3|4|5|6|7|8|9), so once again we have added only convenience, not new representational power. For even more convenience, a hyphen can be included in a character class to indicate a range of characters. This means that [0123456789] could also be written as [0-9] and that the regular expression [a-z] will match any single lowercase letter. A character class can include multiple ranges, so that [a-zA-Z] will match any letter, lower- or uppercase. The period (.) is a meta-character that will match any single character, except (in most implementations) for an end-of-line. These notations can, of course, be used in more complex regular expressions. For example, [A-Z][a-zA-Z]* will match any capitalized word, and $.*$ matches any string of characters enclosed in parentheses.

In most implementations, the meta-character ^ can be used in a regular expression to match the beginning of a line of text, so that the expression ^[a-zA-Z]+ will only match a word that occurs at the start of a line. Similarly, $ is used as a meta-character to match the end of a line. Some implementations also have a way of matching beginnings and ends of words. Typically, \b will match such ``word boundaries.’’ Using this notation, the pattern \band\b will match the string ``and’’ when it occurs as a word, but will not match the a-n-d in the word ``random.’’ We are going a bit beyond basic regular expressions here: Previously, we only thought of a regular expression as something that either will match or will not match a given string in its entirety. When we use a regular expression for a search operation, however, we want to find a substring of a given string that matches the expression. The notations ^, \$ and \ b put a restrictions on where the matching substring can be located in the string.

When regular expressions are used in search-and-replace operations, a regular expression is used for the search pattern. A search is made in a (typically long) string for a substring that matches the pattern, and then the substring is replaced by a specified replacement pattern. The replacement pattern is not used for matching and is not a regular expression. However, it can be more than just a simple string. It’s possible to include parts of the substring that is being replaced in the replacement string. The notations \0,\1, ..., \9 are used for this purpose. The first of these, \0, stands for the entire substring that is being replaced. The others are only available when parentheses are used in the search pattern. The notation \1 stands for ``the part of the substring that matched the part of the search pattern beginning with the first ( in the pattern and ending with the matching ).’’ Similarly, \2 represents whatever matched the part of the search pattern between the second pair of parentheses, and so on.

Suppose, for example, that you would like to search for a name in the form last-name, first-name and replace it with the same name in the form first-name last-name. For example, ``Reeves, Keanu’’ should be converted to ``Keanu Reeves’’. Assuming that names contain only letters, this could be done using the search pattern ([A-Za-z]+), ([A-Za-z]+) and the replacement pattern \2 \1. When the match is made, the first ([A-Za-z]+) will match ``Reeves,’’ so that in the replacement pattern, \1 represents the substring ``Reeves’’. Similarly, \2 will represent ``Keanu’’. Note that the parentheses are included in the search pattern only to specify what parts of the string are represented by \1 and \2. In practice, you might use ^([A-Za-z]+), ([A-Za-z])\$ as the search pattern to constrain it so that it will only match a complete line of text. By using a ``global’’ search-and-replace, you could convert an entire file of names from one format to the other in a single operation.

Regular expressions are a powerful and useful technique that should be part of any computer scientist’s toolbox. This section has given you a taste of what they can do, but you should check out the specific capabilities of the regular expression implementation in the tools and programming languages that you use.

Exercises 17.2.3 Exercises

1.

Give English-language descriptions of the languages generated by the following regular expressions.

$\displaystyle (a\REOR b)^*$
$\displaystyle a^*\REOR b^*$
$\displaystyle b^*(ab^*ab^*)^*$
$\displaystyle b^*(abb^*)$

2.

Give regular expressions over $\Sigma=\ab$ that generate the following languages.

$\displaystyle L_1 = \{ x \ | \ x \text{ contains 3 consecutive $a$'s}\}$
$\displaystyle L_2 = \{ x \ | \ x \text{ has even length}\}$
$\displaystyle L_3 = \{ x \ | \ n_b(x) = 2 \bmod{3}\}$
$\displaystyle L_4 = \{ x \ | \ x \text{ contains the substring } aaba\}$
$\displaystyle L_5 = \{ x \ | \ n_b(x) \lt 2 \}$
$\displaystyle L_6 = \{ x \ | \ x \text{ doesn't end in } aa\}$

3.

Prove that all finite languages are regular.

4.

The backslash is itself a meta-character. Suppose that you want to match a string that contains a backslash character. How do you suppose you would represent the backslash in the regular expression?

5.

Using the notation introduced in this section, write a regular expression that could be used to match each of the following:

Any sequence of letters (upper- or lowercase) that includes the letter Z (in uppercase).
Any eleven-digit telephone number written in the form(xxx)xxx-xxxx.
Any eleven-digit telephone number either in the form (xxx)xxx-xxxx or xxx-xxx-xxxx.
A non-negative real number with an optional decimal part. The expression should match numbers such as 17, 183.9999, 182., 0, 0.001, and 21333.2.
A complete line of text that contains only letters.
A C++ style one-line comment consisting of // and all the following characters up to the end-of-line.

6.

Give a search pattern and a replace pattern that could be used to perform the following conversions:

Convert a string that is enclosed in a pair of double quotes to the same string with the double quotes replaced by single quotes.
Convert seven-digit telephone numbers in the format xxx-xxx-xxxx to the format (xxx)xxx-xxxx.
Convert C++ one-line comments, consisting of characters between // and end-of-line, to C style comments enclosed between /* and */.
Convert any number of consecutive spaces and tabs to a single space.

7.

In some implementations of ``regular expressions,’’ the notations \1, \2, and so on can occur in a search pattern. For example, consider the search pattern ^([a-zA-Z]).*\1$. Here, \1 represents a recurrence of the same substring that matched [a-zA-Z], the part of the pattern between the first pair of parentheses. The entire pattern, therefore, will match a line of text that begins and ends with the same letter. Using this notation, write a pattern that matches all strings in the language $L=\{a^nba^n | n\ge0\}\text{.}$ (Later in this chapter, we will see that $L$ is not a regular language, so allowing the use of \1 in a ``regular expression’’ means that it’s not really a regular expression at all! This notation can add a real increase in expressive power to the patterns that contain it.)

Prev Top Next