Skip to main content

Discrete Mathematics for Computer Science: An open educational resource.

Section 17.1 Formal Language Theory

Subsection 17.1.1 Languages

In formal language theory, an alphabet is a finite, non-empty set. The elements of the set are called symbols. A finite sequence of symbols \(a_1a_2\ldots a_n\) from an alphabet is called a string over that alphabet.

Definition 17.1.1. Alphabet.

An alphabet is a finite, non-empty set. It is customary to use \(\Sigma\) to denote an alphabet.

Definition 17.1.2. Symbol.

The elements of an alphabet are called symbols.

Definition 17.1.3. String.

A finite sequence of symbols \(a_1 a_2 \ldots a_n\) from an alphabet is called a string over that alphabet.

Example 17.1.4.

\(\Sigma = \{0,1\}\)\(011\text{,}\)\(1010\text{,}\)\(1\)\(\Sigma\text{.}\)
Note that strings really are sequences of symbols, which implies that order matters. Thus \(011\text{,}\) \(101\text{,}\) and \(110\) are all different strings, though they are made up of the same symbols. The strings \(x=\aetc{a}{n}\) and \(y=\aetc{b}{m}\) are equal only if \(m=n\) (i.e.the strings contain the same number of symbols) and \(a_i=b_i\) for all \(1 \leq i \leq n\text{.}\)
Just as there are operations defined on numbers, truth values, sets, and other mathematical entities, there are operations defined on strings. Some important operations are:

Definition 17.1.5. Length.

The length of a string \(x\) is the number of symbols in it. The notation for the length of \(x\) is \(|x|\text{.}\) Note that this is consistent with other uses of \(|\ |\text{,}\) all of which involve some notion of size: \(|number|\) measures how big a number is (in terms of its distance from 0); \(|set|\) measures the size of a set (in terms of the number of elements).
We will occasionally refer to a length-n string. This is a slightly awkward, but concise, shorthand for ``a string whose length is \(n\)".

Definition 17.1.6. Concatenation.

The concatenation of two strings \(x=a_1 a_2\ldots a_m\) and \(y=b_1b_2\ldots b_n\) is the sequence of symbols \(a_1\ldots a_mb_1\ldots b_n\text{.}\)
Sometimes \(\cdot\) is used to denote concatenation, but it is far more usual to see the concatenation of \(x\) and \(y\) denoted by \(xy\) than by \(x\cdot y\text{.}\) You can easily convince yourself that concatenation is associative (i.e. \((xy)z = x(yz)\) for all strings \(x,y\) and \(z\text{.}\)) Concatenation is not commutative (i.e. it is not always true that \(xy = yx\text{:}\) for example, if \(x=a\) and \(y=b\) then \(xy=ab\) while \(yx=ba\) and, as discussed above, these strings are not equal.)

Definition 17.1.7. Reversal.

The reverse of a string \(x=a_1a_2\ldots a_n\) is the string \(x^R = a_na_{n-1}\ldots a_2a_1\text{.}\)

Example 17.1.8. String Operations.

Let \(\Sigma = \{a,b\}\) , \(x=a\text{,}\) \(y=abaa\text{,}\) and \(z=bab\text{.}\)
Then \(|x| = 1\text{,}\) \(|y| = 4\text{,}\) and \(|z|=3\text{.}\)
Also, \(xx = aa\text{,}\) \(xy = aabaa\text{,}\) \(xz = abab\text{,}\) and \(zx = baba\text{.}\)
Finally, \(x^R = a\text{,}\) \(y^R= aaba\text{,}\) and \(z^R=bab\text{.}\)
By the way, the previous example illustrates a naming convention standard throughout language theory texts: if a letter is intended to represent a single symbol in an alphabet, the convention is to use a letter from the beginning of the English alphabet \(( a, b, c, d )\text{;}\) if a letter is intended to represent a string, the convention is to use a letter from the end of the English alphabet \(( u, v, \text{etc})\text{.}\)
In set theory, we have a special symbol to designate the set that contains no elements. Similarly, language theory has a special symbol \(\varepsilon\) which is used to represent the empty string, the string with no symbols in it. (Some texts use the symbol \(\lambda\) instead.) It is worth noting that \(|\varep| = 0\text{,}\) that \(\varep^R = \varep\text{,}\) and that \(\varep \cdot x = x \cdot \varep = x\) for all strings \(x\text{.}\) (This last fact may appear a bit confusing. Remember that \(\varep\) is not a symbol in a string with length 1, but rather the name given to the string made up of 0 symbols. Pasting those 0 symbols onto the front or back of a string \(x\) still produces \(x\text{.}\))
The set of all strings over an alphabet \(\Sigma\) is denoted \(\Sigma^*\text{.}\) (In language theory, the symbol \(^*\) is typically used to denote ``zero or more’’, so \(\Sigma^*\) is the set of strings made up of zero or more symbols from \(\Sigma\text{.}\)) Note that while an alphabet \(\Sigma\) is by definition a finite set of symbols, and strings are by definition finite sequences of those symbols, the set \(\Sigma^*\) is always infinite. Why is this? Suppose \(\Sigma\) contains \(n\) elements. Then there is one string over \(\Sigma\) with 0 symbols, \(n\) strings with 1 symbol, \(n^2\) strings with 2 symbols (since there are \(n\) choices for the first symbol and \(n\) choices for the second), \(n^3\) strings with 3 symbols, etc.

Example 17.1.9. Set of all Strings over an Alphabet.

If \(\Sigma = \{1\}\text{,}\) then \(\Sigma^* = \{\varep, 1, 11, 111, \ldots\}\text{.}\)
If \(\Sigma = \ab\text{,}\) then \(\Sigma^* = \{ \varep, a, b, aa, ab, ba, bb, aaa, aab, \ldots\}\text{.}\)
Note that \(\Sigma^*\) is countably infinite: if we list the strings as in the preceding example (length-0 strings, length-1 strings in ``alphabetical" order, length-2 strings similarly ordered, etc) then any string over \(\Sigma\) will eventually appear. (In fact, if \(|\Sigma| = n \geq 2\) and \(x \in \Sigma^*\) has length \(k\text{,}\) then \(x\) will appear on the list within the first \(\frac{n^{k+1} - 1}{n-1}\) entries.)
We now come to the definition of a language in the formal language theoretical sense.

Definition 17.1.10. Languages over an Alphabet.

A language over an alphabet \(\Sigma\) is a subset of \(\Sigma^*\text{.}\) Thus, a language over \(\Sigma\) is an element of \({\cal P}(\Sigma^*)\text{,}\) the power set of \(\Sigma^*\text{.}\)
In other words, any set of strings (over alphabet \(\Sigma\)) constitutes a language (over alphabet \(\Sigma\)).

Example 17.1.11. Languages over a set.

Let \(\Sigma = \{0,1\}\text{.}\) Then the following are all languages over \(\Sigma\text{:}\)
\begin{align*} L_1 \amp = \{011, 1010,111\} \\ L_2 \amp = \{0, 10, 110, 1110, 11110, \ldots\} \\ L_3 \amp = \{x \in \Sigma^* | n_0(x) = n_1(x) \} \\ \amp\text{where the notation } n_0(x) \text{ is the number of } 0\text{'s in the string } x \\ \amp \text{and similarly for } n_1(x)\\ L_4 \amp = \{x| x \text{ is a multiple of } 5 \text{ in binary}\} \end{align*}
Note that languages can be either finite or infinite. Because \(\Sigma^*\) is infinite, it clearly has an infinite number of subsets, and so there are an infinite number of languages over \(\Sigma\text{.}\) But are there countably or uncountably many such languages?
This fact is an immediate consequence of the result, proved in a previous chapter, that the power set of a countably infinite set is uncountable. Since the elements of \({\cal P}(\Sigma)\) are exactly the languages over \(\Sigma\text{,}\) there are uncountably many such languages.
Languages are sets and therefore, as for any sets, it makes sense to talk about the union, intersection, and complement of languages. (When taking the complement of a language over an alphabet \(\Sigma\text{,}\) we always consider the univeral set to be \(\Sigma^*\text{,}\) the set of all strings over \(\Sigma\text{.}\)) Because languages are sets of strings, there are additional operations that can be defined on languages, operations that would be meaningless on more general sets. For example, the idea of concatenation can be extended from strings to languages.
For two sets of strings \(S\) and \(T\text{,}\) we define the concatenation of \(S\) and \(T\) (denoted \(S\cdot T\) or just \(ST\)) to be the set \(ST = \{ st | s \in S \land t\in T \}\text{.}\) For example, if \(S = \{ab, aab\}\) and \(T=\{\varep, 110, 1010\}\text{,}\) then \(ST = \{ab, ab110, ab1010, aab, aab110, aab1010\}\text{.}\) Note in particular that \(ab \in ST\text{,}\) because \(ab \in S\text{,}\) \(\varep \in T\text{,}\) and \(ab \cdot \varep = ab\text{.}\) Because concatenation of sets is defined in terms of the concatenation of the strings that the sets contain, concatenation of sets is associative and not commutative. (This can easily be verified.)
When a set \(S\) is concatenated with itself, the notation \(SS\) is usually scrapped in favour of \(S^2\text{;}\) if \(S^2\) is concatenated with \(S\text{,}\) we write \(S^3\) for the resulting set, etc. So \(S^2\) is the set of all strings formed by concatenating two (possibly different, possibly identical) strings from \(S\text{,}\) \(S^3\) is the set of strings formed by concatenating three strings from \(S\text{,}\) etc. Extending this notation, we take \(S^1\) to be the set of strings formed from one string in \(S\) (i.e \(S^1\) is \(S\) itself), and \(S^0\) to be the set of strings formed from zero strings in \(S\) (i.e.\(S^0 = \{\varep\}\)). If we take the union \(S^0 \cup S^1 \cup S^2 \cup \ldots\text{,}\) then the resulting set is the set of all strings formed by concatenating zero or more strings from \(S\text{,}\) and is denoted \(S^*\text{.}\) The set \(S^*\) is called the Kleene closure of \(S\text{,}\) and the \(^*\) operator is called the Kleene star operator.

Example 17.1.13. Kleene Closure of a set.

Let \(S = \{01, ba\}\text{.}\) Then
\begin{align*} S^0 \amp = \{\varep\}\\ S^1 \amp = \{01, ba\} \\ S^2 \amp = \{0101, 01ba, ba01, baba\}\\ S^3 \amp = \{010101, 0101ba, 01ba01, 01baba, ba0101, ba01ba, baba01, bababa\}\\ \text{ etc, so} \amp \\ S^* \amp =\{\varep,01,ba,0101,01ba,ba01,baba,010101,0101ba,\ldots\}. \end{align*}
Note that this is the second time we have seen the notation \(something^*\text{.}\) We have previously seen that for an alphabet \(\Sigma\text{,}\) \(\Sigma^*\) is defined to be the set of all strings over \(\Sigma\text{.}\) If you think of \(\Sigma\) as being a set of length-1 strings, and take its Kleene closure, the result is once again the set of all strings over \(\Sigma\text{,}\) and so the two notions of \(^*\) coincide.

Example 17.1.14. \(\Sigma^*\).

\(\Sigma = \ab\text{.}\)
\begin{align*} \Sigma^0 \amp = \{\varep\} \\ \Sigma^1 \amp = \ab \\ \Sigma^2 \amp = \{aa, ab, ba, bb\}\\ \Sigma^3 \amp = \{aaa, aab, aba, abb, baa, bab, bba, bbb\} \\ \text{etc, so} \amp \\ \Sigma^* \amp =\{\varep,a,b,aa,ab,ba,bb,aaa,aab,aba,abb,baa,bab,\ldots\}. \end{align*}

Exercises Exercises

1.
Let \(S = \{\varep, ab, abab\}\) and \(T = \{aa, aba, abba, abbba, \ldots\}\text{.}\) Find the following:
  1. \(\displaystyle S^2\)
  2. \(\displaystyle S^3\)
  3. \(\displaystyle S^*\)
  4. \(\displaystyle ST\)
  5. \(\displaystyle TS\)
Answer.
  1. \begin{equation*} S^2 = \{\varep, ab, abab, ababab, abababab\} \end{equation*}
  2. \begin{align*} S^3 = \amp \{\varep, ab, abab, ababab, abababab, \\ \amp ababababab, abababababab, ababababababab,\\ \amp abababababababab \} \end{align*}
  3. \begin{equation*} S^* = \{\varep, ab, abab, ababab, abababab, \ldots \} \end{equation*}
  4. \begin{align*} ST = \amp\{aa, aba, abba, abbba, \ldots, \\ \amp abaa, ababa, ababba, ababbba, \ldots,\\ \amp ababaa, abababa, abababba, abababbba, \ldots\} \end{align*}
  5. \begin{align*} TS = \amp \{aa, aba, abba, abbba, abbbba, \ldots, \\ \amp aaab, abaab, abbaab, abbbaab, \ldots ,\\ \amp aaabab, abaabab, abbaabab, abbbaabab, \ldots\} \end{align*}
2.
The reverse of a language \(L\) is defined to be \(L^R = \{ x^R | x \in L\}\text{.}\) Find \(S^R\) and \(T^R\) for the \(S\) and \(T\) in the preceding problem.
3.
Give an example of a language \(L\) such that \(L=L^*\text{.}\)
Answer.
\begin{equation*} L = \{\varep, a, aa, aaa, aaaa, \ldots\} \end{equation*}