Grammars

Section 17.5 Grammars

Both natural languages, such as English and the artificial languages used for programming have a structure known as grammar or syntax. In order to form legal sentences or programs, the parts of the language must be fit together according to certain rules. For natural languages, the rules are somewhat informal (although high-school English teachers might have us believe differently). For programming languages, the rules are absolute, and programs that violate the rules will be rejected by a compiler.

In this section, we will study formal grammars and languages defined by them. The languages we look at will, for the most part, be “toy” languages, compared to natural languages or even to programming languages, but the ideas and techniques are basic to any study of language. In fact, many of the ideas arose almost simultaneously in the 1950s in the work of linguists who were studying natural language and programmers who were looking for ways to specify the syntax of programming languages.

The grammars in this section are generative grammars. A generative grammar is a set of rules that can be used to generate all the legal strings in a language. We will also consider the closely related idea of parsing. To parse a string means to determine how that string can be generated according to the rules.

This section is a continuation of the preceding section. Like a regular expression, a grammar is a way to specify a possibly infinite language with a finite amount of information. In fact, we will see that every regular language can be specified by a certain simple type of grammar. We will also see that some languages that can be specified by grammars are not regular.

Subsection 17.5.1 Context-free Grammars

In its most general form, a grammar is a set of rewriting rules . A rewriting rule specifies that a certain string of symbols can be substituted for all or part of another string. If $w$ and $u$ are strings, then $w\PRODUCES u$ is a rewriting rule that specifies that the string $w$ can be replaced by the string $u\text{.}$ The symbol “$\PRODUCES$” is read “can be rewritten as.” Rewriting rules are also called production rules or productions, and “$\PRODUCES$” can also be read as “produces.” For example, if we consider strings over the alphabet $\{a,b,c\}\text{,}$ then the production rule $aba\PRODUCES cc$ can be applied to the string $abbabac$ to give the string $abbccc\text{.}$ The substring $aba$ in the string $abbabac$ has been replaced with $cc\text{.}$

In a context-free grammar, every rewriting rule has the form $A\PRODUCES w\text{,}$ where $A$ is single symbol and $w$ is a string of zero or more symbols. The symbols that occur on the left-hand sides of production rules in a context-free grammar are called non-terminal symbols. By convention, the non-terminal symbols are usually uppercase letters. The strings on the right-hand sides of the production rules can include non-terminal symbols as well as other symbols, which are called terminal symbols. By convention, the terminal symbols are usually lowercase letters. Here are some typical production rules that might occur in context-free grammars:

\begin{align*} A\amp \PRODUCES aAbB\\\ S \amp \PRODUCES SS\\ C\amp \PRODUCES Acc\\ B\amp \PRODUCES b\\ A \amp \PRODUCES\EMPTYSTRING \end{align*}

In the last rule in this list, $\EMPTYSTRING$ represents the empty string, as usual. For example, this rule could be applied to the string $aBaAcA$ to produce the string $aBacA\text{.}$ The first occurrence of the symbol $A$ in $aBaAcA$ has been replaced by the empty string---which is just another way of saying that the symbol has been dropped from the string.

In every context-free grammar, one of the non-terminal symbols is designated as the start symbol of the grammar. The start symbol is often, though not always, denoted by $S\text{.}$ When the grammar is used to generate strings in a language, the idea is to start with a string consisting of nothing but the start symbol. Then a sequence of production rules is applied. Each application of a production rule to the string transforms the string to a new string. If and when this process produces a string that consists purely of terminal symbols, the process ends. That string of terminal symbols is one of the strings in the language generated by the grammar. In fact, the language consists precisely of all strings of terminal symbols that can be produced in this way.

As a simple example, consider a grammar that has three production rules:

\begin{gather*} S\PRODUCES aS\\ S\PRODUCES bS \\ S\PRODUCES b \end{gather*}

In this example, $S$ is the only non-terminal symbol, and the terminal symbols are $a$ and $b\text{.}$ Starting from the string $S\text{,}$ we can apply any of the three rules of the grammar to produce either $aS$ , $bS\text{,}$ or $b\text{.}$ Since the string $b$ contains no non-terminals, we see that $b$ is one of the strings in the language generated by this grammar. The strings $aS$ and $bS$ are not in that language, since they contain the non-terminal symbol $S\text{,}$ but we can continue to apply production rules to these strings. From $aS\text{,}$ for example, we can obtain $aaS\text{,}$ $abS\text{,}$ or $ab\text{.}$ From $abS\text{,}$ we go on to obtain $abaS\text{,}$ $abbS\text{,}$ or $abb\text{.}$ The strings $ab$ and $abb$ are in the language generated by the grammar. It’s not hard to see that any string of $a\text{'s}$ and $b\text{'s}$ that ends with a $b$ can be generated by this grammar, and that these are the only strings that can be generated. That is, the language generated by this grammar is the regular language specified by the regular expression $(a\REOR b)^{*}b\text{.}$

It’s time to give some formal definitions of the concepts which we have been discussing.

Definition 17.5.1. Context-free Grammar.

A context-free grammar is a 4-tuple $(V,\Sigma,P,S)\text{,}$ where:

$V$ is a finite set of symbols. The elements of $V$ are the non-terminal symbols of the grammar.
$\Sigma$ is a finite set of symbols such that $V\cap\Sigma=\emptyset\text{.}$ The elements of $\Sigma$ are the terminal symbols of the grammar.
$P$ is a set of production rules. Each rule is of the form $A\PRODUCES w$ where $A$ is one of the symbols in $V$ and $w$ is a string in the language $(V\cup\Sigma)^*\text{.}$
$S\in V\text{.}$ $S$ is the start symbol of the grammar.

Even though this is the formal definition, grammars are often specified informally simply by listing the set of production rules. When this is done it is assumed, unless otherwise specified, that the non-terminal symbols are just the symbols that occur on the left-hand sides of production rules of the grammar. The terminal symbols are all the other symbols that occur on the right-hand sides of production rules. The start symbol is the symbol that occurs on the left-hand side of the first production rule in the list. Thus, the list of production rules

\begin{align*} T\amp\PRODUCES TT \\ T \amp \PRODUCES A\\ A \amp \PRODUCES aAa \\ A \amp \PRODUCES bB \\ B \amp \PRODUCES bB \\ B \amp \PRODUCES \EMPTYSTRING \end{align*}

specifies a grammar $G=(V,\Sigma,P,T)$ where $V$ is $\{T,A,B\}\text{,}$ $\Sigma$ is $\{a,b\}\text{,}$ and $T$ is the start symbol. $P\text{,}$ of course, is a set containing the six production rules in the list.

Let $G=(V,\Sigma,P,S)$ be a context-free grammar. Suppose that $x$ and $y$ are strings in the language $(V\cup\Sigma)^*\text{.}$ The notation $x\YIELDS_G y$ is used to express the fact that $y$ can be obtained from $x$ by applying one of the production rules in $P\text{.}$ To be more exact, we say that $x\YIELDS_G y$ if and only if there is a production rule $A\PRODUCES w$ in the grammar and two strings $u$ and $v$ in the language $(V\cup\Sigma)^*$ such that $x=uAv$ and $y=uwv\text{.}$ The fact that $x=uAv$ is just a way of saying that $A$ occurs somewhere in $x\text{.}$ When the production rule $A\PRODUCES w$ is applied to substitute $w$ for $A$ in $uAv\text{,}$ the result is $uwv\text{,}$ which is $y\text{.}$ Note that either $u$ or $v$ or both can be the empty string.

If a string $y$ can be obtained from a string $x$ by applying a sequence of zero or more production rules, we write $x\YIELDS_G^* y$ . In most cases, the “$G$” in the notations $\YIELDS_G$ and $\YIELDS_G^*$ will be omitted, assuming that the grammar in question is understood. Note that $\YIELDS$ is a relation on the set $(V\cup\Sigma)^*\text{.}$ The relation $\YIELDSTAR$ is the reflexive, transitive closure of that relation. (This explains the use of “$*$”, which is usually used to denote the transitive, but not necessarily reflexive, closure of a relation. In this case, $\YIELDSTAR$ is reflexive as well as transitive since $x\;\YIELDSTAR x$ is true for any string $x\text{.}$) For example, using the grammar that is defined by the above list of production rules, we have

\begin{align*} aTB\amp \YIELDS aTTB\\ \amp \YIELDS aTAB \\ \amp \YIELDS aTAbB \\ \amp \YIELDS aTbBbB\\ \amp \YIELDS aTbbB \end{align*}

From this, it follows that $aTB\;\YIELDSTAR aTbbB\text{.}$ The relation $\YIELDS$ is read “yields” or “produces” while $\YIELDSTAR$ can be read “yields in zero or more steps” or “produces in zero or more steps.” The following theorem states some simple facts about the relations $\YIELDS$ and $\YIELDSTAR\text{:}$

Theorem 17.5.2.

Let $G$ be the context-free grammar $(V,\Sigma,P,S)\text{.}$ Then:

If $x$ and $y$ are strings in $(V\cup\Sigma)^*$ such that $x\YIELDS y\text{,}$ then $x\;\YIELDSTAR y\text{.}$
If $x\text{,}$ $y\text{,}$ and $z$ are strings in $(V\cup\Sigma)^*$ such that $x\;\YIELDSTAR y$ and $y\;\YIELDSTAR z\text{,}$ then $x\;\YIELDSTAR z\text{.}$
If $x$ and $y$ are strings in $(V\cup\Sigma)^*$ such that $x\YIELDS y\text{,}$ and if $s$ and $t$ are any strings in $(V\cup\Sigma)^*\text{,}$ then $sxt\YIELDS syt\text{.}$
If $x$ and $y$ are strings in $(V\cup\Sigma)^*$ such that $x\;\YIELDSTAR y\text{,}$ and if $s$ and $t$ are any strings in $(V\cup\Sigma)^*\text{,}$ then $sxt\;\YIELDSTAR syt\text{.}$

Proof.

Parts 1 and 2 follow from the fact that $\YIELDSTAR$ is the transitive closure of $\YIELDS\text{.}$ Part 4 follows easily from Part 3. (I leave this as an exercise.) To prove Part 3, suppose that $x\text{,}$ $y\text{,}$ $s\text{,}$ and $t$ are strings such that $x\YIELDS y\text{.}$ By definition, this means that there exist strings $u$ and $v$ and a production rule $A\PRODUCES w$ such that $x=uAv$ and $y=uwv\text{.}$ But then we also have $sxt=suAvt$ and $syt=suwvt\text{.}$ These two equations, along with the existence of the production rule $A\PRODUCES w$ show, by definition, that $sxt\YIELDS syt\text{.}$

We can use $\YIELDSTAR$ to give a formal definition of the language generated by a context-free grammar:

Definition 17.5.3. Context-free Language.

Suppose that $G=(V,\Sigma,P,S)$ is a context-free grammar. Then the language generated by $G$ is the language $L(G)$ over the alphabet $\Sigma$ defined by

\begin{equation*} L(G)=\{w\in \Sigma^* | S\YIELDS_G^* w\} \end{equation*}

That is, $L(G)$ contains any string of terminal symbols that can be obtained by starting with the string consisting of the start symbol, $S\text{,}$ and applying a sequence of production rules.

A language $L$ is said to be a context-free language if there is a context-free grammar $G$ such that $L(G)$ is $L\text{.}$ Note that there might be many different context-free grammars that generate the same context-free language. Two context-free grammars that generate the same language are said to be equivalent.

Suppose $G$ is a context-free grammar with start symbol $S$ and suppose $w\in L(G)\text{.}$ By definition, this means that there is a sequence of one or more applications of production rules which produces the string $w$ from $S\text{.}$ This sequence has the form $S\YIELDS x_1\YIELDS x_2\YIELDS\cdots\YIELDS w\text{.}$ Such a sequence is called a derivation of $w$ (in the grammar $G$). Note that $w$ might have more than one derivation. That is, it might be possible to produce $w$ in several different ways.

Consider the language $L=\{a^nb^n| n\in\N\}\text{.}$ We already know that $L$ is not a regular language. However, it is a context-free language. That is, there is a context-free grammar such that $L$ is the language generated by $G\text{.}$ This gives us our first theorem about grammars:

Theorem 17.5.4.

Let $L$ be the language $L=\{a^nb^n| n\in\N\}\text{.}$ Let $G$ be the context-free grammar $(V,\Sigma,P,S)$ where $V=\{S\}\text{,}$ $\Sigma=\{a,b\}$ and $P$ consists of the productions

\begin{align*} S \amp \PRODUCES aSb \\ S \amp \PRODUCES \EMPTYSTRING \end{align*}

Then $L=L(G)\text{,}$ so that $L$ is a context-free language. In particular, there exist context-free languages which are not regular.

Proof.

To show that $L=L(G)\text{,}$ we must show both that $L\SUB L(G)$ and that $L(G)\SUB L\text{.}$ To show that $L\SUB L(G)\text{,}$ let $w$ be an arbitrary element of $L\text{.}$ By definition of $L\text{,}$ $w=a^nb^n$ for some $n\in\N\text{.}$ We show that $w\in L(G)$ by induction on $n\text{.}$ In the case where $n=0\text{,}$ we have $w=\EMPTYSTRING\text{.}$ Now, $\EMPTYSTRING\in L(G)$ since $\EMPTYSTRING$ can be produced from the start symbol $S$ by an application of the rule $S\PRODUCES\EMPTYSTRING\text{,}$ so our claim is true for $n=0\text{.}$ Now, suppose that $k\in\N$ and that we already know that $a^kb^k\in L(G)\text{.}$ We must show that $a^{k+1}b^{k+1}\in L(G)\text{.}$ Since $S\;\YIELDSTAR a^kb^k\text{,}$ we also have, by Theorem 17.5.2, that $aSb\;\YIELDSTAR aa^kb^kb\text{.}$ That is, $aSb\;\YIELDSTAR a^{k+1}b^{k+1}\text{.}$ Combining this with the production rule $S\PRODUCES aSb\text{,}$ we see that $S\;\YIELDSTAR a^{k+1}b^{k+1}\text{.}$ This means that $a^{k+1}b^{k+1}\in L(G)\text{,}$ as we wanted to show. This completes the proof that $L\SUB L(G)\text{.}$

To show that $L(G)\SUB L\text{,}$ suppose that $w\in L(G)\text{.}$ That is, $S\;\YIELDSTAR w\text{.}$ We must show that $w=a^nb^n$ for some $n\text{.}$ Since $S\;\YIELDSTAR w\text{,}$ there is a derivation $S\YIELDS x_0\YIELDS x_1\YIELDS$$\cdots$$\YIELDS x_n\text{,}$ where $w=x_n\text{.}$ We first prove by induction on $n$ that in any derivation $S\YIELDS x_0\YIELDS x_1\YIELDS$$\cdots$$\YIELDS x_n\text{,}$ we must have either $x_n=a^nb^n$ or $x_n=a^{n+1}Sb^{n+1}\text{.}$ Consider the case $n=0\text{.}$ Suppose $S\YIELDS x_0\text{.}$ Then, we must have that $S\PRODUCES x_0$ is a rule in the grammar, so $x_0$ must be either $\EMPTYSTRING$ or $aSb\text{.}$ Since $\EMPTYSTRING=a^0b^0$ and $aSb=a^{0+1}Sb^{0+1}$ , $x_0$ is of the required form. Next, consider the inductive case. Suppose that $k\gt 1$ and we already know that in any derivation $S\YIELDS x_0\YIELDS x_1\YIELDS\cdots\YIELDS x_k\text{,}$ we must have $x_k=a^kb^k$ or $x=a^{k+1}Sb^{k+1}\text{.}$ Suppose that $S\YIELDS x_0\YIELDS x_1\YIELDS$$\cdots$$\YIELDS x_k\YIELDS x_{k+1}\text{.}$ We know by induction that $x_k=a^kb^k$ or $x=a^{k+1}Sb^{k+1}\text{,}$ but since $x_k\YIELDS x_{k+1}$ and $a^kb^k$ contains no non-terminal symbols, we must have $x_k=a^{k+1}Sb^{k+1}\text{.}$ Since $x_{k+1}$ is obtained by applying one of the production rules $S\PRODUCES\EMPTYSTRING$ or $S\PRODUCES aSb$ to $x_k\text{,}$ $x_{k+1}$ is either $a^{k+1}\EMPTYSTRING b^{k+1}$ or $a^{k+1}aSbb^{k+1}\text{.}$ That is, $x_{k+1}$ is either $a^{k+1}b^{k+1}$ or $a^{k+2}Sb^{k+2}\text{,}$ as we wanted to show. This completes the induction. Turning back to $w\text{,}$ we see that $w$ must be of the form $a^nb^n$ or of the form $a^nSb^n\text{.}$ But since $w\in L(G)\text{,}$ it can contain no non-terminal symbols, so $w$ must be of the form $a^nb^n\text{,}$ as we wanted to show. This completes the proof that $L(G)\SUB L\text{.}$

I have given a very formal and detailed proof of this theorem, to show how it can be done and to show how induction plays a role in many proofs about grammars. However, a more informal proof of the theorem would probably be acceptable and might even be more convincing. To show that $L\SUB L(G)\text{,}$ we could just note that the derivation $S\YIELDS aSb\YIELDS a^2Sb^2\YIELDS$$\cdots$$\YIELDS a^nSb^n\YIELDS a^nb^n$ demonstrates that $a^nb^n\in L\text{.}$ On the other hand, it is clear that every derivation for this grammar must be of this form, so every string in $L(G)$ is of the form $a^nb^n\text{.}$

Example 17.5.5.

For another example, consider the language $\{a^nb^m| n\ge m\ge0\}\text{.}$ Let’s try to design a grammar that generates this language. This is similar to the previous example, but now we want to include strings that contain more $a\text{'s}$ than $b\text{'s}\text{.}$ The production rule $S\PRODUCES aSb$ always produces the same number of $a\text{'s}$ and $b\text{'s}\text{.}$ Can we modify this idea to produce more $a\text{'s}$ than $b\text{'s}\text{?}$

One approach would be to produce a string containing just as many $a\text{'s}$ as $b\text{'s}\text{,}$ and then to add some extra $a\text{'s}\text{.}$ A rule that can generate any number of $a\text{'s}$ is $A\PRODUCES aA\text{.}$ After applying the rule $S\PRODUCES aSb$ for a while, we want to move to a new state in which we apply the rule $A\PRODUCES aA\text{.}$ We can get to the new state by applying a rule $S\PRODUCES A$ that changes the $S$ into an $A\text{.}$ We still need a way to finish the process, which means getting rid of all non-terminal symbols in the string. For this, we can use the rule $A\PRODUCES\EMPTYSTRING\text{.}$ Putting these rules together, we get the grammar

\begin{align*} S \amp \PRODUCES aSb\\ S \amp \PRODUCES A \\ A \amp \PRODUCES aA \\ A \amp \PRODUCES \EMPTYSTRING \end{align*}

This grammar does indeed generate the language $\{a^nb^m| n\ge m\ge 0\}\text{.}$ With slight variations on this grammar, we can produce other related languages. For example, if we replace the rule $A\PRODUCES \EMPTYSTRING$ with $A\PRODUCES a\text{,}$ we get the language $\{a^nb^m| n \gt m\ge 0\}$ .

There are other ways to generate the language $\{a^nb^m| n\ge m\ge 0\}\text{.}$ For example, the extra non-terminal symbol, $A\text{,}$ is not really necessary, if we allow $S$ to sometimes produce a single $a$ without a $b\text{.}$ This leads to the grammar

\begin{align*} S \amp \PRODUCES aSb \\ S \amp \PRODUCES aS \\ S \amp \PRODUCES \EMPTYSTRING \end{align*}

(But note that the rule $S\PRODUCES Sa$ would not work in place of $S\PRODUCES aS\text{,}$ since it would allow the production of strings in which an $a$ can follow a $b\text{,}$ and there are no such strings in the language $\{a^nb^m| n\ge m\ge 0\}\text{.}$) And here are two more grammars that generate this language:

\begin{align*} S\amp\PRODUCES AB \amp\qquad S\amp\PRODUCES ASb \\ A\amp\PRODUCES aA \amp\qquad A\amp\PRODUCES aA \\ B\amp\PRODUCES aBb \amp\qquad S\amp\PRODUCES\EMPTYSTRING \\ A\amp\PRODUCES \EMPTYSTRING \amp\qquad A\amp\PRODUCES\EMPTYSTRING \\ B\amp\PRODUCES \EMPTYSTRING \amp\qquad \amp \end{align*}

Example 17.5.6.

Consider another variation on the language $\{a^nb^n| n\in\N\}\text{,}$ in which the $a\text{'s}$ and $b\text{'s}$ can occur in any order, but the number of $a\text{'s}$ is still equal to the number of $b\text{'s}\text{.}$ This language can be defined as $L=\{w\in\{a,b\}^*| n_a(w) = n_b(w)\}\text{.}$ This language includes strings such as $abbaab\text{,}$ $baab\text{,}$ and $bbbaaa\text{.}$

Let’s start with the grammar containing the rules $S\PRODUCES aSb$ and $S\PRODUCES\EMPTYSTRING\text{.}$ We can try adding the rule $S\PRODUCES bSa\text{.}$ Every string that can be generated using these three rules is in the language $L\text{.}$ However, not every string in $L$ can be generated. A derivation that starts with $S\YIELDS aSb$ can only produce strings that begin with $a$ and end with $b\text{.}$ A derivation that starts with $S\YIELDS bSa$ can only generate strings that begin with $b$ and end with $a\text{.}$ There is no way to generate the strings $baab$ or $abbbabaaba\text{,}$ which are in the language $L\text{.}$ But we shall see that any string in $L$ that begins and ends with the same letter can be written in the form $xy$ where $x$ and $y$ are shorter strings in $L\text{.}$ To produce strings of this form, we need one more rule, $S\PRODUCES SS\text{.}$ The complete set of production rules for the language $L$ is

\begin{align*} S\amp\PRODUCES aSb \\ S\amp\PRODUCES bSa \\ S\amp\PRODUCES SS \\ S\amp\PRODUCES \EMPTYSTRING \end{align*}

Proof.

It’s easy to see that every string that can be generated using these rules is in $L\text{,}$ since each rule introduces the same number of $a\text{'s}$ as $b\text{'s}\text{.}$ But we also need to check that every string $w$ in $L$ can be generated by these rules. This can be done by induction on the length of $w\text{,}$ using the second form of the principle of mathematical induction. In the base case, $|w|=0$ and $w=\EMPTYSTRING\text{.}$ In this case, $w\in L$ since $S\YIELDS\EMPTYSTRING$ in one step. Suppose $|w|=k\text{,}$ where $k\gt 0\text{,}$ and suppose that we already know that for any $x\in L$ with $|x| \lt k\text{,}$ $S\;\YIELDSTAR x\text{.}$ To finish the induction we must show, based on this induction hypothesis, that $S\;\YIELDSTAR w\text{.}$

Suppose that the first and last characters of $w$ are different. Then $w$ is either of the form $axb$ or of the form $bxa\text{,}$ for some string $x\text{.}$ Let’s assume that $w$ is of the form $axb\text{.}$ (The case where $w$ is of the form $bxa$ is handled in a similar way.) Since $w$ has the same number of $a\text{'s}$ and $b\text{'s}$ and since $x$ has one fewer $a$ than $w$ and one fewer $b$ than $w\text{,}$ $x$ must also have the same number of $a\text{'s}$ as $b\text{'s}\text{.}$ That is $x\in L\text{.}$ But $|x|=|w|-2 \lt k\text{,}$ so by the induction hypothesis, $x\in L(G)\text{.}$ So we have $S\;\YIELDSTAR x\text{.}$ By Theorem 17.5.2, we get then $aSb\;\YIELDSTAR axb\text{.}$ Combining this with the fact that $S\YIELDS aSb\text{,}$ we get that $S\;\YIELDSTAR axb\text{,}$ that is, $S\;\YIELDSTAR w\text{.}$ This proves that $w\in L(G)\text{.}$

Finally, suppose that the first and last characters of $w$ are the same. Let’s say that $w$ begins and ends with $a\text{.}$ (The case where $w$ begins and ends with $b$ is handled in a similar way.) I claim that $w$ can be written in the form $xy$ where $x\in L(G)$ and $y\in L(G)$ and neither $x$ nor $y$ is the empty string. This will finish the induction, since we will then have by the induction hypothesis that $S\;\YIELDSTAR x$ and $S\;\YIELDSTAR y\text{,}$ and we can derive $xy$ from $S$ by first applying the rule $S\PRODUCES SS$ and then using the first $S$ on the right-hand side to derive $x$ and the second to derive $y\text{.}$

It only remains to figure out how to divide $w$ into two strings $x$ and $y$ which are both in $L(G)\text{.}$ The technique that is used is one that is more generally useful. Suppose that $w=c_1c_2\cdots c_k\text{,}$ where each $c_i$ is either $a$ or $b\text{.}$ Consider the sequence of integers $r_1\text{,}$ $r_2\text{,}$ $\dots\text{,}$ $r_k$ where for each $i = 1, 2, \dots, k\text{,}$ $r_i$ is the number of $a\text{'s}$ in $c_1c_2\cdots c_i$ minus the number of $b\text{'s}$ in $c_1c_2\cdots c_i\text{.}$ Since $c_1=a\text{,}$ $r_1=1\text{.}$ Since $w\in L\text{,}$ $r_k=0\text{.}$ And since $c_k=a\text{,}$ we must have $r_{k-1}= r_k-1 = -1\text{.}$ Furthermore the difference between $r_{i+1}$ and $r_i$ is either $1$ or $-1\text{,}$ for $i=1,2,\dots,k-1\text{.}$

Since $r_1=1$ and $r_{k-1}=-1$ and the value of $r_i$ goes up or down by 1 when $i$ increases by 1, $r_i$ must be zero for some $i$ between 1 and $k-1\text{.}$ That is, $r_i$ cannot get from 1 to $-1$ unless it passes through zero. Let $i$ be a number between 1 and $k-1$ such that $r_i=0\text{.}$ Let $x=c_1c_2\cdots c_i$ and let $y=c_{i+1}c_{i+2}\cdots c_k\text{.}$ Note that $xy=w\text{.}$ The fact that $r_i=0$ means that the string $c_1c_2\cdots c_i$ has the same number of $a\text{'s}$ and $b\text{'s}\text{,}$ so $x\in L(G)\text{.}$ It follows automatically that $y\in L(G)$ also. Since $i$ is strictly between 1 and $k-1\text{,}$ neither $x$ nor $y$ is the empty string. This is all that we needed to show to finish the proof that $L=L(G)\text{.}$

The basic idea of this proof is that if $w$ contains the same number of $a\text{'s}$ as $b\text{'s}\text{,}$ then an $a$ at the beginning of $w$ must have a “matching” $b$ somewhere in $w\text{.}$ This $b$ matches the $a$ in the sense that the corresponding $r_i$ is zero, and the $b$ marks the end of a string $x$ which contains the same number of $a\text{'s}$ as $b\text{'s}\text{.}$ For example, in the string $aababbabba\text{,}$ the $a$ at the beginning of the string is matched by the third $b\text{,}$ since $aababb$ is the shortest prefix of $aababbabba$ that has an equal number of $a\text{'s}$ and $b\text{'s}\text{.}$

Example 17.5.7. Balanced Parentheses.

Closely related to this idea of matching $a\text{'s}$ and $b\text{'s}$ is the idea of balanced parentheses. Consider a string made up of parentheses, such as (()(()))(()). The parentheses in this sample string are balanced because each left parenthesis has a matching right parenthesis, and the matching pairs are properly nested. A careful definition uses the sort of integer sequence introduced in the above proof. Let $w$ be a string of parentheses. Write $w=c_1c_2\cdots c_n\text{,}$ where each $c_i$ is either ( or ). Define a sequence of integers $r_1\text{,}$ $r_2\text{,}$ $\dots\text{,}$$r_n\text{,}$ where $r_i$ is the number of left parentheses in $c_1c_2\cdots c_i$ minus the number of right parentheses. We say that the parentheses in $w$ are balanced if $r_n=0$ and $r_i\ge0$ for all $i=1,2,\dots,n\text{.}$ The fact that $r_n=0$ says that $w$ contains the same number of left parentheses as right parentheses. The fact the $r_i\ge0$ means that the nesting of pairs of parentheses is correct: You can’t have a right parenthesis unless it is balanced by a left parenthesis in the preceding part of the string. The language that consists of all balanced strings of parentheses is context-free. It is generated by the grammar

\begin{align*} S\amp\PRODUCES (\,S\,)\\ S\amp\PRODUCES SS \\ S\amp\PRODUCES \EMPTYSTRING \end{align*}

The proof is similar to the preceding proof about strings of $a\text{'s}$ and $b\text{'s}\text{.}$ (It might seem that I’ve made an awfully big fuss about matching and balancing. The reason is that this is one of the few things that we can do with context-free languages that we can’t do with regular languages.)

Before leaving this section, we should look at a few more general results. Since we know that most operations on regular languages produce languages that are also regular, we can ask whether a similar result holds for context-free languages. We will see later that the intersection of two context-free languages is not necessarily context-free. Also, the complement of a context-free language is not necessarily context-free. However, some other operations on context-free languages do produce context-free languages.

Theorem 17.5.8.

Suppose that $L$ and $M$ are context-free languages. Then the languages $L\cup M\text{,}$ $LM\text{,}$ and $L^*$ are also context-free.

Proof.

I will prove only the first claim of the theorem, that $L\cup M$ is context-free. In the exercises for this section, you are asked to construct grammars for $LM$ and $L^*$ (without giving formal proofs that your answers are correct).

Let $G=(V,\Sigma,P,S)$ and $H=(W,\Gamma,Q,T)$ be context-free grammars such that $L=L(G)$ and $M=L(H)\text{.}$ We can assume that $W\cap V=\emptyset\text{,}$ since otherwise we could simply rename the non-terminal symbols in $W\text{.}$ The idea of the proof is that to generate a string in $L\cup M\text{,}$ we first decide whether we want a string in $L$ or a string in $M\text{.}$ Once that decision is made, to make a string in $L\text{,}$ we use production rules from $G\text{,}$ while to make a string in $M\text{,}$ we use rules from $H\text{.}$ We have to design a grammar, $K\text{,}$ to represent this process.

Let $R$ be a symbol that is not in any of the alphabets $V\text{,}$ $W\text{,}$ $\Sigma\text{,}$ or $\Gamma\text{.}$ $R$ will be the start symbol of $K\text{.}$ The production rules for $K$ consist of all the production rules from $G$ and $H$ together with two new rules:

\begin{gather*} R\PRODUCES S\\ R\PRODUCES T \end{gather*}

Formally, $K$ is defined to be the grammar

\begin{equation*} (V\cup W\cup\{R\}, P\cup Q\cup \{R\PRODUCES S, R\PRODUCES T\}, \Sigma\cup\Gamma, R) \end{equation*}

Suppose that $w\in L\text{.}$ That is $w\in L(G)\text{,}$ so there is a derivation $S\YIELDS_G^*w\text{.}$ Since every rule from $G$ is also a rule in $K\text{,}$ if follows that $S\YIELDS_K^* w\text{.}$ Combining this with the fact that $R\YIELDS_K S\text{,}$ we have that $R\YIELDS_K^* w\text{,}$ and $w\in L(K)\text{.}$ This shows that $L\SUB L(K)\text{.}$ In an exactly similar way, we can show that $M\SUB L(K)\text{.}$ Thus, $L\cup M\SUB L(K)\text{.}$

It remains to show that $L(K)\SUB L\cup M\text{.}$ Suppose $w\in L(K)\text{.}$ Then there is a derivation $R\YIELDS_K^*w\text{.}$ This derivation must begin with an application of one of the rules $R\PRODUCES S$ or $R\PRODUCES T\text{,}$ since these are the only rules in which $R$ appears. If the first rule applied in the derivation is $R\PRODUCES S\text{,}$ then the remainder of the derivation shows that $S\YIELDS_K^*w\text{.}$ Starting from $S\text{,}$ the only rules that can be applied are rules from $G\text{,}$ so in fact we have $S\YIELDS_G^*w\text{.}$ This shows that $w\in L\text{.}$ Similarly, if the first rule applied in the derivation $R\YIELDS_K^*w$ is $R\PRODUCES T\text{,}$ then $w\in M\text{.}$ In any case, $w\in L\cup M\text{.}$ This proves that $L(K)\SUB L\cup M\text{.}$

Finally, we should clarify the relationship between context-free languages and regular languages. We have already seen that there are context-free languages which are not regular. On the other hand, it turns out that every regular language is context-free. That is, given any regular language, there is a context-free grammar that generates that language. This means that any syntax that can be expressed by a regular expression, by a DFA, or by an NFA could also be expressed by a context-free grammar. In fact, we only need a certain restricted type of context-free grammar to duplicate the power of regular expressions.

Definition 17.5.9. Right-Regular Grammar.

A right-regular grammar is a context-free grammar in which the right-hand side of every production rule has one of the following forms: the empty string; a string consisting of a single non-terminal symbol; or a string consisting of a single terminal symbol followed by a single non-terminal symbol.

Examples of the types of production rule that are allowed in a right-regular grammar are $A\PRODUCES\EMPTYSTRING\text{,}$ $B\PRODUCES C\text{,}$ and $D\PRODUCES aE\text{.}$ The idea of the proof is that given a right-regular grammar, we can build a corresponding $NFA$ and vice-versa. The states of the $NFA$ correspond to the non-terminal symbols of the grammar. The start symbol of the grammar corresponds to the starting state of the NFA. A production rule of the form $A\PRODUCES bC$ corresponds to a transition in the NFA from state $A$ to state $C$ while reading the symbol $b\text{.}$ A production rule of the form $A\PRODUCES B$ corresponds to an $\EMPTYSTRING$-transition from state $A$ to state $B$ in the NFA. And a production rule of the form $A\PRODUCES\EMPTYSTRING$ exists in the grammar if and only if $A$ is a final state in the NFA. With this correspondence, a derivation of a string $w$ in the grammar corresponds to an execution path through the NFA as it accepts the string $w\text{.}$ I won’t give a complete proof here. You are welcome to work through the details if you want. But the important fact is:

Theorem 17.5.10.

A language $L$ is regular if and only if there is a right-regular grammar $G$ such that $L=L(G)\text{.}$ In particular, every regular language is context-free.

Subsection 17.5.2 Backus-Naur Form

Context-free grammars are used to describe some aspects of the syntax of programming languages. However, the notation that is used for grammars in the context of programming languages is somewhat different from the notation introduced in the preceding section. The notation that is used is called Backus-Naur Form or BNF. It is named after computer scientists John Backus and Peter Naur, who developed the notation. Actually, several variations of BNF exist. I will discuss one of them here. BNF can be used to describe the syntax of natural languages, as well as programming languages, and some of the examples in this section will deal with the syntax of English.

Like context-free grammars, BNF grammars make use of production rules, non-terminals, and terminals. The non-terminals are usually given meaningful, multi-character names. Here, I will follow a common practice of enclosing non-terminals in angle brackets, so that they can be easily distinguished. For example, $\langle noun \rangle$ and $\langle sentence \rangle$ could be non-terminals in a BNF grammar for English, while $\langle program \rangle$ and $\langle if{\text -}statement \rangle$ might be used in a BNF grammar for a programming language. Note that a BNF non-terminal usually represents a meaningful syntactic category, that is, a certain type of building block in the syntax of the language that is being described, such as an adverb, a prepositional phrase, or a variable declaration statement. The terminals of a BNF grammar are the things that actually appear in the language that is being described. In the case of natural language, the terminals are individual words.

In BNF production rules, I will use the symbol “$::=$” in place of the “$\PRODUCES$” that is used in context-free grammars. BNF production rules are more powerful than the production rules in context-free grammars. That is, one BNF rule might be equivalent to several context-free grammar rules. As for context-free grammars, the left-hand side of a BNF production rule is a single non-terminal symbol. The right hand side can include terminals and non-terminals, and can also use the following notations, which should remind you of notations used in regular expressions:

A vertical bar, $\BNFALT\text{,}$ indicates a choice of alternatives. For example,

\begin{equation*} \langle digit \rangle ::=\ 0 \BNFALT\ 1 \BNFALT\ 2 \BNFALT\ 3 \BNFALT\ 4 \BNFALT\ 5 \BNFALT\ 6 \BNFALT\ 7 \BNFALT\ 8 \BNFALT\ 9 \end{equation*}

indicates that the non-terminal $\langle digit \rangle$ can be replaced by any one of the terminal symbols $0, 1, \dots,9\text{.}$
Items enclosed in brackets are optional. For example,

\begin{equation*} \langle declaration \rangle ::= \langle type \rangle \,\langle variable \rangle [ = \langle expression \rangle ] ; \end{equation*}

says that $\langle declaration \rangle$ can be replaced either by “$\langle type \rangle$ $\langle variable \rangle\text{;}$” or by “$\langle type \rangle$ $\langle variable \rangle$ = $\langle expression \rangle\text{;}$”. (The symbols “=” and “;” are terminal symbols in this rule.)
Items enclosed between “[” and “]$\dots$” can be repeated zero or more times. (This has the same effect as a “$*$” in a regular expression.) For example,

\begin{equation*} \langle integer \rangle ::=\ \langle digit \rangle [ \langle digit \rangle ]\dots \end{equation*}

says that an $\langle integer \rangle$ consists of a $\langle digit \rangle$ followed optionally by any number of additional $\langle digit \rangle$’s. That is, the non-terminal $\langle integer \rangle$ can be replaced by $\langle digit \rangle$ or by $\langle digit \rangle$$\langle digit \rangle$ or by $\langle digit \rangle$$\langle digit \rangle$$\langle digit \rangle\text{,}$ and so on.
Parentheses can be used as usual, for grouping.

All these notations can be expressed in a context-free grammar by introducing additional production rules. For example, the BNF rule “$\langle sign \rangle ::= + | -$” is equivalent to the two rules, “$\langle sign \rangle ::= + $” and “$\langle sign \rangle ::= - $”. A rule that contains an optional item can also be replaced by two rules. For example,

\begin{equation*} \langle declaration \rangle ::=\ \langle type \rangle \langle variable \rangle [ = \langle expression \rangle ] ; \end{equation*}

can be replaced by the two rules

\begin{align*} \langle declaration \rangle \amp ::=\ \langle type \rangle \langle variable \rangle ;\\ \langle declaration \rangle \amp ::=\ \langle type \rangle \langle variable \rangle = \langle expression \rangle ; \end{align*}

In context-free grammars, repetition can be expressed by using a recursive rule such as “$S\PRODUCES aS$”, in which the same non-terminal symbol appears both on the left-hand side and on the right-hand side of the rule. BNF-style notation using “[” and “]$\dots$” can be eliminated by replacing it with a new non-terminal symbol and adding a recursive rule to allow that symbol to repeat zero or more times. For example, the production rule

\begin{equation*} \langle integer \rangle ::=\ \langle digit \rangle [ \langle digit \rangle ]\dots \end{equation*}

can be replaced by three rules using a new non-terminal symbol $\langle digit{\text -}list \rangle$ to represent a string of zero or more $\langle digit \rangle$’s:

\begin{align*} \langle integer \rangle \amp ::= \langle digit \rangle \langle digit{\text -}list \rangle \\ \langle digit{\text -}list \rangle \amp ::= \langle digit \rangle \langle digit{\text -}list \rangle \\ \langle digit{\text -}list \rangle \amp ::= \EMPTYSTRING \end{align*}

Example 17.5.11.

As an example of a complete BNF grammar, let’s look at a BNF grammar for a very small subset of English. The start symbol for the grammar is $\langle sentence \rangle\text{,}$ and the terminal symbols are English words. All the sentences that can be produced from this grammar are syntactically correct English sentences, although you wouldn’t encounter many of them in conversation. Here is the grammar:

\begin{align*} \langle sentence \rangle \amp ::=\ \langle simple{\text -}sentence \rangle [ and \langle simple{\text -}sentence \rangle ]\dots \\ \langle simple{\text -}sentence \rangle \amp ::=\ \langle noun{\text -}part \rangle \langle verb{\text -}part \rangle \\ \langle noun{\text -}part \rangle \amp ::=\ \langle article \rangle \langle noun \rangle [ who \langle verb{\text -}part \rangle ]\dots \\ \langle verb{\text -}part \rangle \amp ::=\ \langle intransitive{\text -}verb \rangle \BNFALT\ ( \langle transitive{\text -}verb \rangle \langle noun{\text -}part \rangle ) \\ \langle article \rangle \amp ::=\ the \BNFALT\ a \\ \langle noun \rangle \amp ::=\ man \BNFALT\ woman \BNFALT\ dog \BNFALT\ cat \BNFALT\ computer \\ \langle intransitive{\text -}verb \rangle \amp ::=\ runs \BNFALT\ jumps \BNFALT\ hides \\ \langle transitive{\text -}verb \rangle \amp ::=\ knows \BNFALT\ loves \BNFALT\ chases \BNFALT\ owns \end{align*}

This grammar can generate sentences such as “A dog chases the cat and the cat hides” and “The man loves a woman who runs.” The second sentence, for example, is generated by the derivation

\begin{align*} \langle sentence \rangle\ \amp \YIELDS\ \langle simple{\text -}sentence \rangle \\ \amp \YIELDS\ \langle noun{\text -}part \rangle\ \langle verb{\text -}part \rangle\\ \amp \YIELDS\ \langle article \rangle\ \langle noun \rangle\ \langle verb{\text -}part \rangle \\ \amp \YIELDS\ \text{the}\ \langle noun \rangle\ \langle verb{\text -}part \rangle\\ \amp \YIELDS\ \text{the}\ \text{man}\ \langle verb{\text -}part \rangle \\ \amp \YIELDS\ \text{the}\ \text{man}\ \langle transitive{\text -}verb \rangle\ \langle noun{\text -}part \rangle\\ \amp \YIELDS\ \text{the}\ \text{man}\ \text{loves}\ \langle noun{\text -}part \rangle \\ \amp \YIELDS\ \text{the}\ \text{man}\ \text{loves}\ \langle article \rangle \ \langle noun \rangle\ \text{who}\ \langle verb{\text -}part \rangle \\ \amp \YIELDS\ \text{the}\ \text{man}\ \text{loves}\ \text{a} \ \langle noun \rangle\ \text{who}\ \langle verb{\text -}part \rangle\\ \amp \YIELDS\ \text{the}\ \text{man}\ \text{loves}\ \text{a} \ \text{woman}\ \text{who}\ \langle verb{\text -}part \rangle\\ \amp \YIELDS\ \text{the}\ \text{man}\ \text{loves}\ \text{a} \ \text{woman}\ \text{who}\ \langle intransitive{\text -}verb \rangle\\ \amp \YIELDS\ \text{the}\ \text{man}\ \text{loves}\ \text{a} \ \text{woman}\ \text{who}\ \text{runs} \end{align*}

BNF is most often used to specify the syntax of programming languages. Most programming languages are not, in fact, context-free languages, and BNF is not capable of expressing all aspects of their syntax. For example, BNF cannot express the fact that a variable must be declared before it is used or the fact that the number of actual parameters in a subroutine call statement must match the number of formal parameters in the declaration of the subroutine. So BNF is used to express the context-free aspects of the syntax of a programming language, and other restrictions on the syntax---such as the rule about declaring a variable before it is used---are expressed using informal English descriptions.

When BNF is applied to programming languages, the terminal symbols are generally “tokens,” which are the minimal meaningful units in a program. For example, the pair of symbols $\lt =$ constitute a single token, as does a string such as "Hello World". Every number is represented by a single token. (The actual value of the number is stored as a so-called “attribute” of the token, but the value plays no role in the context-free syntax of the language.) I will use the symbol $\textbf{number}$ to represent a numerical token. Similarly, every variable name, subroutine name, or other identifier in the program is represented by the same token, which I will denote as $\textbf{ident}\text{.}$ One final complication: Some symbols used in programs, such as “]” and “(”, are also used with a special meaning in BNF grammars. When such a symbol occurs as a terminal symbol, I will enclose it in double quotes. For example, in the BNF production rule

\begin{equation*} \langle array{\text -}reference \rangle ::=\ \textbf{ident } \text{"[" } \langle expression \rangle \text{ "]"} \end{equation*}

the “$[$” and “$]$” are terminal symbols in the language that is being described, rather than the BNF notation for an optional item. With this notation, here is part of a BNF grammar that describes statements in the Java programming language:

\begin{align*} \langle statement \rangle \amp ::=\ \langle block{\text -}statement \rangle \BNFALT\ \langle if{\text -}statement \rangle \\ \amp \,\BNFALT\ \langle while{\text -}statement\rangle \BNFALT\ \langle assignment{\text -}statement \rangle\\ \amp \, \BNFALT\ \langle null{\text -}statement \rangle\\ \langle block{\text -}statement \rangle\amp ::=\ \{ [ \langle statement \rangle ]\dots\ \}\\ \langle if{\text -}statement \rangle \amp ::=\ if \text{"("} \langle condition \rangle \text{")"} \langle statement \rangle [ else \langle statement \rangle] \\ \langle while{\text -}statement \rangle \amp ::=\ while \text{"("} \langle condition \rangle \text{")"} \langle statement \rangle \\ \langle assignment{\text -}statement \rangle \amp ::=\ \langle variable \rangle = \langle expression \rangle ; \\ \langle null{\text -}statement \rangle\amp ::=\ \EMPTYSTRING \end{align*}

The non-terminals $\langle condition \rangle\text{,}$ $\langle variable \rangle\text{,}$ and $\langle expression \rangle$ would, of course, have to be defined by other production rules in the grammar. Here is a set of rules that define simple expressions, made up of numbers, identifiers, parentheses and the arithmetic operators $+, -, *$ and $/\text{:}$

\begin{align*} \langle expression \rangle \amp ::=\ \langle term \rangle [ \,[ + | - ] \langle term \rangle ]\dots\\ \langle term \rangle \amp ::=\ \langle factor \rangle [ \, [ * | / ] \langle factor \rangle ]\dots \\ \langle factor \rangle \amp ::=\ \textbf{ident} | \textbf{number} | \text{"}(\text{"}\langle expression \rangle \text{"})\text{"} \end{align*}

The first rule says that an $\langle expression \rangle$ is a sequence of one or more $\langle term \rangle$’s, separated by plus or minus signs. The second rule defines a $\langle term \rangle$ to be a sequence of one or more $\langle factors \rangle\text{,}$ separated by multiplication or division operators. The last rule says that a $\langle factor \rangle$ can be either an identifier or a number or an $\langle expression \rangle$ enclosed in parentheses. This small BNF grammar can generate expressions such as “$3*5$” and

\begin{equation*} x*(x+1) - 3/(z+2*(3-x)) + 7\text{.} \end{equation*}

The latter expression is made up of three terms: $x*(x+1)\text{,}$ $3/(z+2*(3-x))\text{,}$ and $7\text{.}$ The first of these terms is made up of two factors, $x$ and $(x+1)\text{.}$ The factor $(x+1)$ consists of the expression $x+1$ inside a pair of parentheses.

The nice thing about this grammar is that the precedence rules for the operators are implicit in the grammar. For example, according to the grammar, the expression $3+5*7$ is seen as $\langle term \rangle$ + $\langle term \rangle$ where the first term is $3$ and the second term is $5*7\text{.}$ The $5*7$ occurs as a group, which must be evaluated before the result is added to $3\text{.}$ Parentheses can change the order of evaluation. For example, $(3+5)*7$ is generated by the grammar as a single $\langle term \rangle$ of the form $\text{.}$ The first $\langle factor \rangle$ is $(3+5)\text{.}$ When $(3+5)*7$ is evaluated, the value of $(3+5)$ is computed first and then multiplied by $7\text{.}$ This is an example of how a grammar that describes the syntax of a language can also reflect its meaning.

Although this section has not introduced any really new ideas or theoretical results, I hope it has demonstrated how context-free grammars can be applied in practice.

Subsection 17.5.3 Parsing and Parse Trees

Suppose that $G$ is a grammar for the language $L\text{.}$ That is, $L=L(G)\text{.}$ The grammar $G$ can be used to generate strings in the language $L\text{.}$ In practice, though, we often start with a string which might or might not be in $L\text{,}$ and the problem is to determine whether the string is in the language and, if so, how it can be generated by $G\text{.}$ The goal is to find a derivation of the string, using the production rules of the grammar, or to show that no such derivation exists. This is known as parsing the string. When the string is a computer program or a sentence in a natural language, parsing the string is an essential step in determining its meaning.

As an example that we will use throughout this section, consider the language that consists of arithmetic expressions containing parentheses, the binary operators $+$ and $*\text{,}$ and the variables $x\text{,}$ $y\text{,}$ and $z\text{.}$ Strings in this language include $x\text{,}$ $x+y*z\text{,}$ and $((x+y)*y)+z*z\text{.}$ Here is a context-free grammar that generates this language:

\begin{align*} E \amp \PRODUCES E+E\\ E \amp \PRODUCES E*E\\ E \amp \PRODUCES (E)\\ E \amp \PRODUCES x\\ E \amp \PRODUCES y\\ E \amp \PRODUCES z \end{align*}

Call the grammar described by these production rules $G_1\text{.}$ The grammar $G_1$ says that $x\text{,}$ $y\text{,}$ and $z$ are expressions, and that you can make new expressions by adding two expressions, by multiplying two expressions, and by enclosing an expression in parentheses. (Later, we’ll look at other grammars for the same language---ones that turn out to have certain advantages over $G_1\text{.}$)

Consider the string $x+y*z\text{.}$ To show that this string is in the language $L(G_1)\text{,}$ we can exhibit a derivation of the string from the start symbol $E\text{.}$ For example:

\begin{align*} E\amp \YIELDS E+E\\ \amp \YIELDS E+E*E\\ \amp \YIELDS E+y*E\\ \amp \YIELDS x+y*E\\ \amp \YIELDS x+y*z \end{align*}

This derivation shows that the string $x+y*z$ is in fact in $L(G_1)\text{.}$ Now, this string has many other derivations. At each step in the derivation, there can be a lot of freedom about which rule in the grammar to apply next. Some of this freedom is clearly not very meaningful. When faced with the string $E+E*E$ in the above example, the order in which we replace the $E\text{'}s$ with the variables $x\text{,}$ $y\text{,}$ and $z$ doesn’t much matter. To cut out some of this meaningless freedom, we could agree that in each step of a derivation, the non-terminal symbol that is replaced is the leftmost non-terminal symbol in the string. A derivation in which this is true is called a left derivation. The following left derivation of the string $x+y*z$ uses the same production rules as the previous derivation, but it applies them in a different order:

\begin{align*} E\amp \YIELDS E+E\\ \amp \YIELDS x+E\\ \amp \YIELDS x+E*E\\ \amp \YIELDS x+y*E\\ \amp \YIELDS x+y*z \end{align*}

It shouldn’t be too hard to convince yourself that any string that has a derivation has a left derivation (which can be obtained by changing the order in which production rules are applied).

We have seen that the same string might have several different derivations. We might ask whether it can have several different left derivations. The answer is that it depends on the grammar. A context-free grammar $G$ is said to be ambiguous if there is a string $w\in L(G)$ such that $w$ has more than one left derivation according to the grammar $G\text{.}$

Our example grammar $G_1$ is ambiguous. In fact, in addition to the left derivation given above, the string $x+y*z$ has the alternative left derivation

\begin{align*} E\amp \YIELDS E*E\\ \amp \YIELDS E+E*E\\ \amp \YIELDS x+E*E\\ \amp \YIELDS x+y*E\\ \amp \YIELDS x+y*z \end{align*}

In this left derivation of the string $x+y*z\text{,}$ the first production rule that is applied is $E\PRODUCES E*E\text{.}$ The first $E$ on the right-hand side eventually yields “$x+y$” while the second yields “$z$”. In the previous left derivation, the first production rule that was applied was $E\PRODUCES E+E\text{,}$ with the first $E$ on the right yielding “$x$” and the second $E$ yielding “$y*z$”. If we think in terms of arithmetic expressions, the two left derivations lead to two different interpretations of the expression $x+y*z\text{.}$ In one interpretation, the $x+y$ is a unit that is multiplied by $z\text{.}$ In the second interpretation, the $y*z$ is a unit that is added to $x\text{.}$ The second interpretation is the one that is correct according to the usual rules of arithmetic. However, the grammar allows either interpretation. The ambiguity of the grammar allows the string to be parsed in two essentially different ways, and only one of the parsings is consistent with the meaning of the string. Of course, the grammar for English is also ambiguous. In a famous example, it’s impossible to tell whether a “pretty girls’ camp” is meant to describe a pretty camp for girls or a camp for pretty girls.

When dealing with artificial languages such as programming languages, it’s better to avoid ambiguity. The grammar $G_1$ is perfectly correct in that it generates the correct set of strings, but in a practical situation where we are interested in the meaning of the strings, $G_1$ is not the right grammar for the job. There are other grammars that generate the same language as $G_1\text{.}$ Some of them are unambiguous grammars that better reflect the meaning of the strings in the language. For example, the language $L(G_1)$ is also generated by the BNF grammar

\begin{align*} E \amp ::= T\ [\ +\ T\ ]\dots\\ T \amp ::= F\ [\ *\ F\ ]\dots\\ F \amp ::= \text{}\ E\ \text{}\ \BNFALT\ x\ \BNFALT\ y\ \BNFALT\ z \end{align*}

This grammar can be translated into a standard context-free grammar, which I will call $G_2\text{:}$

\begin{align*} E \amp \PRODUCES TA\\ A \amp\PRODUCES +TA \\ A \amp \PRODUCES \EMPTYSTRING \\ T \amp \PRODUCES FB\\ B \amp \PRODUCES *FB\\ B \amp \PRODUCES \EMPTYSTRING \\ F \amp \PRODUCES (E)\\ F \amp \PRODUCES x\\ F \amp \PRODUCES y\\ F \amp \PRODUCES z \end{align*}

The language generated by $G_2$ consists of all legal arithmetic expressions made up of parentheses, the operators $+$ and $-\text{,}$ and the variables $x\text{,}$ $y\text{,}$ and $z\text{.}$ That is, $L(G_2)=L(G_1)\text{.}$ However, $G_2$ is an unambiguous grammar. Consider, for example, the string $x+y*z\text{.}$ Using the grammar $G_2\text{,}$ the only left derivation for this string is:

\begin{align*} E \amp \YIELDS TA\\ \amp \YIELDS FBA\\ \amp \YIELDS xBA\\ \amp \YIELDS xA\\ \amp \YIELDS x+TA\\ \amp \YIELDS x+FBA\\ \amp \YIELDS x+yBA\\ \amp \YIELDS x+y*FBA\\ \amp \YIELDS x+y*zBA\\ \amp \YIELDS x+y*zA\\ \amp \YIELDS x+y*z \end{align*}

There is no choice about the first step in this derivation, since the only production rule with $E$ on the left-hand side is $E\PRODUCES TA\text{.}$ Similarly, the second step is forced by the fact that there is only one rule for rewriting a $T\text{.}$ In the third step, we must replace an $F\text{.}$ There are four ways to rewrite $F\text{,}$ but only one way to produce the $x$ that begins the string $x+y*z\text{,}$ so we apply the rule $F\PRODUCES x\text{.}$ Now, we have to decide what to do with the $B$ in $xBA\text{.}$ There two rules for rewriting $B\text{,}$ $B\PRODUCES *FB$ and $B\PRODUCES\EMPTYSTRING\text{.}$ However, the first of these rules introduces a non-terminal, $*\text{,}$ which does not match the string we are trying to parse. So, the only choice is to apply the production rule $B\PRODUCES\EMPTYSTRING\text{.}$ In the next step of the derivation, we must apply the rule $A\PRODUCES +TA$ in order to account for the $+$ in the string $x+y*z\text{.}$ Similarly, each of the remaining steps in the left derivation is forced.

The fact that $G_2$ is an unambiguous grammar means that at each step in a left derivation for a string $w\text{,}$ there is only one production rule that can be applied which will lead ultimately to a correct derivation of $w\text{.}$ However, $G_2$ actually satisfies a much stronger property: at each step in the left derivation of $w\text{,}$ we can tell which production rule has to be applied by looking ahead at the next symbol in $w\text{.}$ We say that $G_2$ is an LL(1) grammar. (This notation means that we can read a string from Left to right and construct a Left derivation of the string by looking ahead at most 1 character in the string.) Given an LL(1) grammar for a language, it is fairly straightforward to write a computer program that can parse strings in that language. If the language is a programming language, then parsing is one of the essential steps in translating a computer program into machine language. LL(1) grammars and parsing programs that use them are often studied in courses in programming languages and the theory of compilers.

Not every unambiguous context-free grammar is an LL(1) grammar. Consider, for example, the following grammar, which I will call $G_3\text{:}$

\begin{align*} E \amp \PRODUCES E + T\\ E \amp \PRODUCES T\\ T \amp \PRODUCES T * F\\ T \amp \PRODUCES F\\ F \amp \PRODUCES (E)\\ F \amp \PRODUCES x\\ F \amp \PRODUCES y\\ F \amp \PRODUCES z \end{align*}

This grammar generates the same language as $G_1$ and $G_2\text{,}$ and it is unambiguous. However, it is not possible to construct a left derivation for a string according to the grammar $G_3$ by looking ahead one character in the string at each step. The first step in any left derivation must be either $E\YIELDS E+T$ or $E\YIELDS T\text{.}$ But how can we decide which of these is the correct first step? Consider the strings $(x+y)*z$ and $(x+y)*z+z*x\text{,}$ which are both in the language $L(G_3)\text{.}$ For the string $(x+y)*z\text{,}$ the first step in a left derivation must be $E\YIELDS T\text{,}$ while the first step in a left derivation of $(x+y)*z+z*x$ must be $E\YIELDS E+T\text{.}$ However, the first seven characters of the strings are identical, so clearly looking even seven characters ahead is not enough to tell us which production rule to apply. In fact, similar examples show that looking ahead any given finite number of characters is not enough.

However, there is an alternative parsing procedure that will work for $G_3\text{.}$ This alternative method of parsing a string produces a right derivation of the string, that is, a derivation in which at each step, the non-terminal symbol that is replaced is the rightmost non-terminal symbol in the string. Here, for example, is a right derivation of the string $(x+y)*z$ according to the grammar $G_3\text{:}$

\begin{align*} E \amp \YIELDS T \\ \amp \YIELDS T*F \\ \amp \YIELDS T*z \\ \amp \YIELDS F * z \\ \amp \YIELDS (E) * z \\ \amp \YIELDS (E+T) * z \\ \amp \YIELDS (E+F) * z \\ \amp \YIELDS (E+y) * z \\ \amp \YIELDS (T+y) * z \\ \amp \YIELDS (F+y) * z \\ \amp \YIELDS (x+y) * z \end{align*}

The parsing method that produces this right derivation produces it from “bottom to top.” That is, it begins with the string $(x+y)*z$ and works backward to the start symbol $E\text{,}$ generating the steps of the right derivation in reverse order. The method works because $G_3$ is what is called an LR(1) grammar. That is, roughly, it is possible to read a string from Left to right and produce a Right derivation of the string, by looking ahead at most 1 symbol at each step. Although LL(1) grammars are easier for people to work with, LR(1) grammars turn out to be very suitable for machine processing, and they are used as the basis for the parsing process in many compilers.

LR(1) parsing uses a shift/reduce algorithm. Imagine a cursor or current position that moves through the string that is being parsed. We can visualize the cursor as a vertical bar, so for the string $(x+y)*z\text{,}$ we start with the configuration $|(x+y)*z\text{.}$ A shift operation simply moves the cursor one symbol to the right. For example, a shift operation would convert $|(x+y)*z$ to $(|x+y)*z\text{,}$ and a second shift operation would convert that to $(x|+y)*z\text{.}$ In a reduce operation, one or more symbols immediately to the left of the cursor are recognized as the right-hand side of one of the production rules in the grammar. These symbols are removed and replaced by the left-hand side of the production rule. For example, in the configuration $(x|+y)*z\text{,}$ the $x$ to the left of the cursor is the right-hand side of the production rule $F\PRODUCES x\text{,}$ so we can apply a reduce operation and replace the $x$ with $F\text{,}$ giving $(F|+y)*z\text{.}$ This first reduce operation corresponds to the last step in the right derivation of the string, $(F+y)*z\YIELDS (x+y)*z\text{.}$ Now the $F$ can be recognized as the right-hand side of the production rule $T\PRODUCES F\text{,}$ so we can replace the $F$ with $T\text{,}$ giving $(T|+y)*z\text{.}$ This corresponds to the next-to-last step in the right derivation, $(T+y)*z\YIELDS (F+y)*z\text{.}$

At this point, we have the configuration $(T|+y)*z\text{.}$ The $T$ could be the right-hand side of the production rule $E\PRODUCES T\text{.}$ However, it could also conceivably come from the rule $T\PRODUCES T*F\text{.}$ How do we know whether to reduce the $T$ to $E$ at this point or to wait for a $*F$ to come along so that we can reduce $T*F\,\text{?}$ We can decide by looking ahead at the next character after the cursor. Since this character is a $+$ rather than a $*\text{,}$ we should choose the reduce operation that replaces $T$ with $E\text{,}$ giving $(E|+y)*z\text{.}$ What makes $G_3$ an LR(1) grammar is the fact that we can always decide what operation to apply by looking ahead at most one symbol past the cursor.

After a few more shift and reduce operations, the configuration becomes $(E)|*z\text{,}$ which we can reduce to $T|*z$ by applying the production rules $F\PRODUCES (E)$ and $T\PRODUCES F\text{.}$ Now, faced with $T|*z\text{,}$ we must once again decide between a shift operation and a reduce operation that applies the rule $E\PRODUCES T\text{.}$ In this case, since the next character is a $*$ rather than a $+\text{,}$ we apply the shift operation, giving $T*|z\text{.}$ From there we get, in succession, $T*z|\text{,}$ $T*F|\text{,}$ $T|/\text{,}$ and finally $E|\text{.}$ At this point, we have reduced the entire string $(x+y)*z$ to the start symbol of the grammar. The very last step, the reduction of $T$ to $E$ corresponds to the first step of the right derivation, $E\YIELDS T\text{.}$

In summary, LR(1) parsing transforms a string into the start symbol of the grammar by a sequence of shift and reduce operations. Each reduce operation corresponds to a step in a right derivation of the string, and these steps are generated in reverse order. Because the steps in the derivation are generated from “bottom to top,” LR(1) parsing is a type of bottom-up parsing. LL(1) parsing, on the other hand, generates the steps in a left derivation from “top to bottom” and so is a type of top-down parsing.

Although the language generated by a context-free grammar is defined in terms of derivations, there is another way of presenting the generation of a string that is often more useful. A parse tree displays the generation of a string from the start symbol of a grammar as a two dimensional diagram. Here are two parse trees that show two derivations of the string $x+y*z$ according to the grammar $G_1\text{,}$ which was given at the beginning of this section:

Figure showing two different parse tree derivations of the string $x+y*z$ . — Figure 17.5.12.

A parse tree is made up of terminal and non-terminal symbols, connected by lines. The start symbol is at the top, or “root,” of the tree. Terminal symbols are at the lowest level, or “leaves,” of the tree. (For some reason, computer scientists traditionally draw trees with leaves at the bottom and root at the top.) A production rule $A\PRODUCES w$ is represented in a parse tree by the symbol $A$ lying above all the symbols in $w\text{,}$ with a line joining $A$ to each of the symbols in $w\text{.}$ For example, in the left parse tree above, the root, $E\text{,}$ is connected to the symbols $E\text{,}$ $+\text{,}$ and $E\text{,}$ and this corresponds to an application of the production rule $E\PRODUCES E+E\text{.}$

It is customary to draw a parse tree with the string of non-terminals in a row across the bottom, and with the rest of the tree built on top of that base. Thus, the two parse trees shown above might be drawn as:

Figure showing the same two parse trees as above but with the non-terminals aligned at the bottom. — Figure 17.5.13.

Given any derivation of a string, it is possible to construct a parse tree that shows each of the steps in that derivation. However, two different derivations can give rise to the same parse tree, since the parse tree does not show the order in which production rules are applied. For example, the parse tree on the left, above, does not show whether the production rule $E\PRODUCES x$ is applied before or after the production rule $E\PRODUCES y\text{.}$ However, if we restrict our attention to left derivations, then we find that each parse tree corresponds to a unique left derivation and vice versa. I will state this fact as a theorem, without proof. A similar result holds for right derivations.

Theorem 17.5.14.

Let $G$ be a context-free grammar. There is a one-to-one correspondence between parse trees and left derivations based on the grammar $G\text{.}$

Based on this theorem, we can say that a context-free grammar $G$ is ambiguous if and only if there is a string $w\in L(G)$ which has two parse trees.

Exercises 17.5.4 Exercises

1.

One of the examples in this section was a grammar for a subset of English. Give five more examples of sentences that can be generated from that grammar. Your examples should, collectively, use all the rules of the grammar.

2.

Rewrite the example BNF grammar for a subset of English as a context-free grammar.

3.

Write a single BNF production rule that is equivalent to the following context-free grammar:

\begin{align*} S \amp \PRODUCES aSa\\ S \amp \PRODUCES bB\\ B \amp \PRODUCES bB \\ B \amp \PRODUCES \EMPTYSTRING \end{align*}

4.

Write a BNF production rule that specifies the syntax of real numbers, as they appear in programming languages such as Java and C. Real numbers can include a sign, a decimal point and an exponential part. Some examples are: 17.3, .73, 23.1e67, $-$1.34E$-$12, +0.2, 100E+100

5.

Variable references in the Java programming language can be rather complicated. Some examples include: $x,$ $list.next,$ $A[7],$ $a.b.c,$ $S[i+1].grid[r][c].red,$ $\dots\text{.}$ Write a BNF production rule for Java variables. You can use the token ident and the non-terminal $\langle expression \rangle$ in your rule.

6.

Use BNF to express the syntax of the try$\dots$ catch statement in the Java programming language.

7.

Give a BNF grammar for compound propositions made up of propositional variables, parentheses, and the logical operators $\land\text{,}$ $\lor\text{,}$ and $\neg\text{.}$ Use the non-terminal symbol $\langle pv \rangle$ to represent a propositional variable. You do not have to give a definition of $\langle pv \rangle\text{.}$

8.

Show that each of the following grammars is ambiguous by finding a string that has two left derivations according to the grammar:

\begin{align*} S \amp \PRODUCES SS \\ S \amp \PRODUCES aSb\\ S \amp \PRODUCES bSa \\ S \amp\PRODUCES\EMPTYSTRING \end{align*}
\begin{align*} S \amp \PRODUCES ASb \\ S \amp \PRODUCES \EMPTYSTRING\\ A \amp \PRODUCES aA\\ A \amp\PRODUCES a \end{align*}

Answer.

There are many answers to these.

$abab $

\begin{align*} S \amp \YIELDS SS \\ \amp \YIELDS aSbS \\ \amp \YIELDS aSb \\ \amp \YIELDS abaSb \\ \amp \YIELDS abab \end{align*}

\begin{align*} S \amp \YIELDS aSb \\ \amp \YIELDS abSab \\ \amp \YIELDS abab \end{align*}
$aab$

\begin{align*} S \amp \YIELDS ASb \\ \amp \YIELDS aASb \\ \amp \YIELDS aaSb \\ \amp \YIELDS aab \end{align*}

\begin{align*} S \amp \YIELDS ASb \\ \amp \YIELDS Ab \\ \amp \YIELDS aAb \\ \amp \YIELDS aab \end{align*}

9.

Consider the string $z+(x+y)*x\text{.}$ Find a left derivation of this string according to each of the grammars $G_1\text{,}$ $G_2\text{,}$ and $G_3\text{,}$ as given in this section.

10.

Draw a parse tree for the string $(x+y)*z*x$ according to each of the grammars $G_1\text{,}$ $G_2\text{,}$ and $G_3\text{,}$ as given in this section.

11.

Draw three different parse trees for the string $ababbaab$ based on the grammar given in part a) of exercise 1.

12.

Suppose that the string $abbcabac$ has the following parse tree, according to some grammar $G\text{:}$

$parse tree for string $abbcabac$$

List five production rules that must be rules in the grammar $G\text{,}$ given that this is a valid parse tree.
Give a left derivation for the string $abbcabac$ according to the grammar $G\text{.}$
Give a right derivation for the string $abbcabac$ according to the grammar $G\text{.}$

13.

Show the full sequence of shift and reduce operations that are used in the LR(1) parsing of the string $x+(y)*z$ according to the grammar $G_3\text{,}$ and give the corresponding right derivation of the string.

14.

This section showed how to use LL(1) and LR(1) parsing to find a derivation of a string in the language $L(G)$ generated by some grammar $G\text{.}$ How is it possible to use LL(1) or LR(1) parsing to determine for an arbitrary string $w$ whether $w\in L(G)\,\text{?}$ Give an example.

Prev Top Next