COT 6315 - Proof 11.5

COT6315 Formal Languages and Theory of Computation

R. E. Newman-Wolfe, University of Florida

Last modified 2/21/95.

Definition 11.3: Chomsky Normal Form

A grammer G = in CFG is in Chomsky Normal Form (CNF) iff for all A in V and x in P(A), either

x in V^2, or
x in T.

All of the productions are of the form A -> BC or A -> a for A, B, C in V and a in T.

Theorem 11.5: Equivalence of CNF Grammar

G in CFG => Exists G' in CNF such that L(G') = L(G) - {e}.

Proof

Let G1 = Elim-e-Prod(G) and G2 = Elim-Unit-Prod(G1). By Theorems 11.1 and 11.2, L(G2) = L(G1) = L(G) - {e} and G2 has neither unit productions nor e-productions. For any A in V2 and alpha in P2(A), either alpha = a for some a in T, or |alpha| > 1 (since there are no e-productions, |alpha| > 0, and since there are no unit productions, |alpha| = 1 => alpha is in T). For every terminal a appearing in some alpha for which alpha in P2(A) for some A in V2, introduce a new variable X_a:

    V3 = V2 U (X_a | a in T and exists A in V2, alpha, beta in (V2 U T)* with
		alpha.a.beta in P2(A) }.
    FOREACH X_a in V3
	P3(X_a) = {a}
    G" = 
    FOREACH A in V2
        P3(A) = P2(A) - {gamma in P2(A) | |gamma| > 1 and exists a in T, 
		alpha, beta in (V2 U T)* such that gamma = alpha.a.beta} 
		U {gamma' in V3* | gamma' =>*_G" gamma, and gamma in P2(A)}
    G3 =

The last step in building P3 replaces all the long right-hand sides containing terminals with equivalent right-hand sides that contain only variables; each terminal a is replaced with the newly introduced variable X_a, so the old right-hand side is derivable from the new one by applying productions to the X_a's to regenerate the terminals. This makes it obvious that L(G3) = L(G2).

Now we "telescope" each long production by introducing another round of variables that each have only productions with short right-hand sides but allow us to derive the old, long right-hand side from the same variable in a number of steps using the new variables.

    V4 = V3
    FOREACH A in V3
        P4(A) = {a | a in P3(A)} U {BC | BC in P3(A)}
    i = 0
    FOREACH A in V3
	FOREACH alpha = B1B2...Bk in P3(A)
            V4 = V4 U {Y_i+j | j = 1,2,...,k-2}
	    P4(A) = P4(A) U {B1.Y_i+1}
            FOREACH(j, j=i+1 to i+k-3)
		P4(Y_j) = {Bj-i+1.Y_j+1}
	    P4(Y_i+k-2) = {Bk-1.Bk}
            i = i+k-2
    G4 =

Clearly, A =>_G3 B1B2...Bk and k>2 iff A => B1.Y_j => B1.B2.Y_j+1 =>* B1.B2....Bk-2.Y_j+k-2 => B1B2...Bk in G4. Since the shorter productions are preserved intact, L(G4) = L(G3). Thus L(G4) = L(G) - {e}. Letting G' = G4, we have the result.

QED

Send comments to nemo@cis.ufl.edu