Having Fun with Parsers

Imagine you're speaking a language only your computer can understand, and it's hanging on to every word you say, trying to figure out exactly what you mean. This is essentially what a parser does—it reads your code (or any structured input) and tries to make sense of it according to a set of rules. But parsers aren’t just abstract tools; they’re the foundation of compilers, interpreters, and even natural language processing systems. Learning about parsers helps you understand not just how computers process code, but how they make decisions and resolve ambiguities in the languages we use to instruct them.

In this journey, we’ll dive into parsers, starting from an intuitive understanding of how they work, why they’re important, and how to get your hands dirty with actual code that parses a grammar. Along the way, you'll see that parsers aren’t just serious tools—they can be a lot of fun to build and experiment with!

Recursive Descent Parsing

Motivation

At the heart of any programming language is its grammar—a set of rules that define how valid sentences (or expressions) are structured. A recursive descent parser is one of the simplest and most intuitive ways to implement a parser, especially when you’re getting started. It breaks down the parsing process into a series of functions, where each function is responsible for recognizing a specific part of the grammar. The recursive nature allows it to easily handle nested structures, which are common in most programming languages.

What It Does

A recursive descent parser works by reading the input (like code) one piece at a time and trying to match it against the grammar’s rules. Each rule corresponds to a function, and the parser recursively calls these functions based on what it encounters in the input. If the input conforms to the rules, the parser successfully builds a syntax tree. If not, it throws an error.

For example, if you're parsing arithmetic expressions like 3 + (4 * 5), the parser can break this down into smaller parts: numbers, operators, and parentheses, and process them in a recursive fashion.

What It Cannot Do

While recursive descent parsers are easy to understand and implement, they do have limitations. The biggest is that they cannot handle left-recursive grammars. A left-recursive grammar is one where a rule refers to itself as its first step, such as:

Expr -> Expr + Term

This leads to infinite recursion. To use recursive descent, we need to rewrite such rules in a non-left-recursive form. Recursive descent parsers also typically work with LL(1) grammars, meaning they only look one symbol ahead in the input to make decisions.

Example: Arithmetic Grammar

Let’s look at a simple grammar for arithmetic expressions:

Expr    -> Term ExprPrime
ExprPrime -> '+' Term ExprPrime | ε
Term    -> Factor TermPrime
TermPrime -> '*' Factor TermPrime | ε
Factor  -> '(' Expr ')' | NUMBER

This grammar can handle addition (+) and multiplication (*) with parentheses and numbers.

  • Expr represents an expression.

  • Term represents a term in an expression (multiplication part).

  • Factor is either a number or a parenthesized expression.

  • ε represents the empty string, meaning the rule can terminate.

Parsing Example

Let’s parse the expression 3 + 4 * 5 using the above grammar:

  1. Expr starts by calling Term to parse the first part of the expression.

  2. Term calls Factor, which recognizes 3 as a number.

  3. ExprPrime checks for a + after the term and finds it, so it proceeds to call Term again for the next part of the expression.

  4. Term calls Factor again, which now sees 4.

  5. TermPrime looks for a * operator and finds it, so it parses the next factor 5.

  6. After all rules have been satisfied, the parser confirms that the input is valid.

The result is a parsed expression with the correct precedence: 3 + (4 * 5).

Code Outline (Pseudocode)

def parse_expr():
    parse_term()
    parse_expr_prime()

def parse_expr_prime():
    if next_token == '+':
        consume('+')
        parse_term()
        parse_expr_prime()

def parse_term():
    parse_factor()
    parse_term_prime()

def parse_term_prime():
    if next_token == '*':
        consume('*')
        parse_factor()
        parse_term_prime()

def parse_factor():
    if next_token == '(':
        consume('(')
        parse_expr()
        consume(')')
    elif next_token == NUMBER:
        consume(NUMBER)

In this pseudocode, each function corresponds to a rule in the grammar. The parser consumes tokens one by one and recursively applies the rules.

LL(1) Parsing: Predicting with One Symbol of Lookahead

Motivation

While recursive descent parsing is intuitive and works well for many grammars, it can become inefficient or even unusable when grammars grow more complex or contain left-recursion. To address these issues, we can use a more structured approach known as an LL(1) parser. The "LL" stands for Left-to-right parsing and Leftmost derivation, while the "1" means the parser uses one symbol of lookahead to decide which rule to apply. This approach gives us more control over how to parse the input and ensures that we can build efficient and predictable parsers.

LL(1) parsers use a parsing table and avoid recursion, making them suitable for automated parser generation tools (e.g., Yacc, Bison, ANTLR). This makes them a preferred choice for many compilers.

What It Does

An LL(1) parser reads the input one symbol at a time and uses a lookahead symbol to decide which production rule to apply. The parser constructs a parsing table based on the grammar, where each row corresponds to a non-terminal and each column corresponds to a lookahead symbol. The table contains entries that specify which rule to use when a non-terminal is encountered with a certain lookahead.

The key advantage of LL(1) parsers is their predictability and efficiency in parsing top-down. Each decision is made using a single lookahead, so parsing is fast and deterministic.

What It Cannot Do

LL(1) parsers, like recursive descent parsers, cannot handle left-recursive grammars. Moreover, they are restricted to grammars where the correct rule can be determined with only one lookahead symbol. Grammars that require more than one lookahead (LL(k) grammars) or that contain significant ambiguities can’t be parsed by an LL(1) parser.

Example: Arithmetic Grammar for LL(1)

We’ll use a modified version of the arithmetic grammar from the recursive descent parser, rewritten to fit the LL(1) approach:

Expr    -> Term ExprPrime
ExprPrime -> '+' Term ExprPrime | ε
Term    -> Factor TermPrime
TermPrime -> '*' Factor TermPrime | ε
Factor  -> '(' Expr ')' | NUMBER

This grammar defines the structure of basic arithmetic expressions with addition and multiplication. It is not left-recursive and is suitable for LL(1) parsing.

Parsing Table

For the given grammar, the parsing table would look something like this:

Non-Terminal+*()NUMBEREOF
ExprTerm ExprPrimeTerm ExprPrime
ExprPrime+ Term ExprPrimeεε
TermFactor TermPrimeFactor TermPrime
TermPrimeε* Factor TermPrimeεε
Factor( Expr )NUMBER
  • For Expr, when the lookahead is ( or NUMBER, the parser uses the rule Expr -> Term ExprPrime.

  • For ExprPrime, if the lookahead is +, it expands to + Term ExprPrime. If it sees a ) or EOF, it uses the ε production.

  • The other rules similarly follow the grammar’s structure.

Parsing Example

Let’s parse 3 + 4 * 5 using the LL(1) parsing table:

  1. Stack: Initially, the stack has the start symbol Expr, and the lookahead is the first token 3.

  2. The stack has Expr, and using the table for Expr with lookahead NUMBER, the parser expands Expr -> Term ExprPrime.

  3. The stack now contains Term ExprPrime. The parser consults the table for Term with lookahead NUMBER, and expands Term -> Factor TermPrime.

  4. The stack now contains Factor TermPrime. Using the table for Factor with lookahead NUMBER, it expands to NUMBER, and the parser matches the token 3 with NUMBER.

  5. The lookahead now becomes +, and the parser consults the table for ExprPrime with +, expanding ExprPrime -> + Term ExprPrime.

  6. The stack now contains + Term ExprPrime. The parser matches + and advances to the next token, 4.

  7. Repeating this process, the parser continues expanding Term, Factor, and TermPrime according to the lookahead tokens until the entire input is consumed and the stack is empty.

The result is a successful parse of the expression 3 + 4 * 5.

Code Outline (Pseudocode)

def parse_expr():
    parse_term()
    parse_expr_prime()

def parse_expr_prime():
    if lookahead == '+':
        consume('+')
        parse_term()
        parse_expr_prime()
    # If lookahead is ')' or EOF, ε production

def parse_term():
    parse_factor()
    parse_term_prime()

def parse_term_prime():
    if lookahead == '*':
        consume('*')
        parse_factor()
        parse_term_prime()
    # If lookahead is ')' or EOF, ε production

def parse_factor():
    if lookahead == '(':
        consume('(')
        parse_expr()
        consume(')')
    elif lookahead == NUMBER:
        consume(NUMBER)

In this pseudocode, instead of recursion, the parser consults the parsing table at each step and expands based on the current lookahead token. The process is deterministic and ensures that every step is guided by the table.

LR(0) Parsing: Shift-Reduce Parsing Without Lookahead

Motivation

While LL(1) parsers focus on top-down parsing and require a clear decision-making process for each step, LR(0) parsers are a type of bottom-up parser. They work by gradually constructing a parse tree by starting from the input tokens and combining them into higher-level structures (non-terminals) until they reduce the entire input to the start symbol of the grammar.

An LR(0) parser is called a shift-reduce parser because it shifts input symbols onto a stack and then reduces a group of symbols according to the grammar’s production rules. One of the key advantages of LR parsers is their power—they can handle a wider range of grammars than LL parsers, including some ambiguous grammars. However, the simplest form of LR parsing, LR(0), does not use any lookahead to decide which production to apply, making it both powerful but sometimes too eager in its reductions.

What It Does

An LR(0) parser works by maintaining a stack of symbols and states. It alternates between two actions:

  1. Shift: Read the next token from the input and push it onto the stack.

  2. Reduce: Apply a production rule by replacing a group of symbols on the stack with the corresponding non-terminal according to the grammar.

At each step, the parser consults a parsing table to decide whether to shift or reduce. This table is constructed from an LR(0) automaton based on the grammar.

The parser also has an initial state and tracks its current state through a set of transitions. The goto function is used to move between states based on the non-terminal symbols after a reduction.

What It Cannot Do

Because LR(0) parsers do not use any lookahead, they sometimes encounter situations where they cannot decide whether to shift or reduce, leading to shift/reduce conflicts. This limits the number of grammars that LR(0) parsers can handle. To deal with these conflicts, more advanced versions like LR(1) and LALR(1) introduce lookahead symbols to make better decisions.

Additionally, LR(0) parsers struggle with grammars that are not LR(0) grammars, which means they can’t handle ambiguities or cases where multiple reductions could apply without lookahead.

Example: Simple Grammar for LR(0)

Consider the following grammar for arithmetic expressions with addition:

Expr   -> Expr '+' Term
Expr   -> Term
Term   -> NUMBER

This grammar can handle simple expressions like 3 + 4 + 5, where Expr is made up of terms connected by +, and each Term is a number.

LR(0) Automaton

To build an LR(0) parser, we first construct an LR(0) automaton. This automaton consists of items, which are production rules with a dot (•) indicating how much of the rule has been processed. For example, the item:

Expr -> Expr • '+' Term

means that the parser has already seen the Expr part of the rule and is now expecting a '+' followed by Term.

Here are the LR(0) items for our grammar:

(1) Expr -> • Expr '+' Term
(2) Expr -> • Term
(3) Term -> • NUMBER
(4) Expr -> Expr • '+' Term
(5) Expr -> Term •

These items are organized into states, and transitions between the states occur based on the input tokens.

Parsing Table for LR(0)

The parsing table consists of two parts:

  • Action table: Determines whether to shift or reduce based on the current state and the next input symbol.

  • Goto table: Determines which state to transition to after a reduction, based on the non-terminal that was reduced.

StateNUMBER+$ExprTerm
0Shift 3Goto 1Goto 2
1Shift 4Accept
2Reduce 2Reduce 2
3Reduce 3Reduce 3
4Shift 3Goto 5
5Reduce 1Shift 4
  • State 0: If the next token is NUMBER, the parser will shift the token and move to State 3.

  • State 1: If the next token is '+', the parser shifts the token and moves to State 4.

  • State 3: If the input token is NUMBER, the parser reduces using the rule Term -> NUMBER.

  • State 5: Reduce using Expr -> Expr '+' Term.

Parsing Example

Let’s parse the expression 3 + 4 using the LR(0) parser:

  1. Stack: Initially, the stack contains the start state 0, and the lookahead is 3.

  2. Shift: The parser consults the table for State 0 and sees that for NUMBER, it should shift to State 3. The stack becomes 0 NUMBER 3.

  3. Reduce: The parser now consults State 3 and reduces NUMBER to Term. The stack becomes 0 Term 2.

  4. Shift: The next input is +, and in State 1, the table tells the parser to shift to State 4. The stack becomes 0 Expr 1 + 4.

  5. Shift: The next input is 4, so the parser shifts to State 3. The stack becomes 0 Expr 1 + 4 NUMBER 3.

  6. Reduce: The parser reduces NUMBER to Term, and the stack becomes 0 Expr 1 + 4 Term 2.

  7. Reduce: The parser reduces Expr -> Expr + Term, leading to the final reduction. The input is successfully parsed as Expr.

Code Outline (Pseudocode)

Here’s a simplified pseudocode structure for an LR(0) parser:

def parse():
    stack = [0]  # Start with the initial state
    while True:
        state = stack[-1]
        lookahead = get_next_token()

        action = action_table[state][lookahead]

        if action == 'shift':
            stack.append(lookahead)
            stack.append(new_state)
        elif action == 'reduce':
            # Pop symbols corresponding to the production
            stack = stack[:-2 * length_of_production]
            non_terminal = lhs_of_production
            goto_state = goto_table[stack[-1]][non_terminal]
            stack.append(non_terminal)
            stack.append(goto_state)
        elif action == 'accept':
            return "Input accepted"
        else:
            return "Error"

The action table determines whether to shift or reduce, and the goto table determines how to transition after reductions.

Canonical LR(1) Parsing: Powerful Parsing with Lookahead

Motivation

While LR(0) parsers are effective for simple grammars, they encounter conflicts in more complex situations, especially when the parser has to decide between shifting or reducing without enough information. Canonical LR(1) parsers address this by introducing a lookahead of 1 token to guide decisions. The “1” in LR(1) signifies that the parser uses one token of lookahead to choose between conflicting parsing actions, allowing it to handle a much broader range of grammars, including those with ambiguities that LR(0) parsers cannot resolve.

Canonical LR(1) parsers are extremely powerful and can handle all context-free grammars. This makes them an excellent choice for building compilers that require robust parsing capabilities. However, they do come with the cost of more complex parsing tables and larger automata, which is addressed in LALR parsers (a more efficient variant).

What It Does

An LR(1) parser uses items (just like in LR(0)), but with an added lookahead symbol to help make decisions during parsing. This lookahead is included in the construction of the LR(1) automaton, and it enables the parser to differentiate between similar grammar rules by checking what comes next in the input. The parser still operates as a shift-reduce parser, but now it consults both the current state and the lookahead symbol to decide whether to shift, reduce, or accept.

The key difference between LR(1) and LR(0) is the extra power granted by the lookahead, which eliminates many of the conflicts that arise in LR(0) parsing.

What It Cannot Do

Though Canonical LR(1) parsers are extremely powerful, they are often seen as impractical for large grammars because of the large size of the parsing tables. The sheer number of states and transitions in the automaton can be overwhelming. This is why in practical applications, LALR(1) parsers are often used instead, as they combine the power of LR(1) with the efficiency of smaller tables.

Example: Simple Grammar for LR(1)

Consider the following grammar for arithmetic expressions with addition and parentheses:

Expr   -> Term ExprPrime
ExprPrime -> '+' Term ExprPrime | ε
Term   -> '(' Expr ')' | NUMBER

This grammar can handle simple expressions with numbers, addition, and parentheses.

LR(1) Items

To build an LR(1) parser, we start with LR(1) items, which look like LR(0) items but include a lookahead. For example, an LR(1) item might look like this:

Expr -> Term • ExprPrime, {+, $}

This item indicates that we are in the middle of parsing the rule Expr -> Term ExprPrime, and our lookahead symbol is either '+' or the end of input ($). The lookahead helps the parser decide whether to continue parsing ExprPrime or reduce.

Here are some of the LR(1) items for the grammar:

(1) Expr -> • Term ExprPrime, {$}
(2) ExprPrime -> • '+' Term ExprPrime, {$}
(3) Term -> • '(' Expr ')', {+, $}
(4) Term -> • NUMBER, {+, $}
(5) Expr -> Term • ExprPrime, {$}
(6) ExprPrime -> '+' • Term ExprPrime, {$}

LR(1) Automaton

The LR(1) automaton consists of states constructed from the LR(1) items. Each state represents a set of items, and transitions between states occur based on the next symbol in the input and the lookahead symbol.

Parsing Table for LR(1)

The parsing table for an LR(1) parser is similar to that of LR(0), but it uses the lookahead symbol to make decisions. Here's an example parsing table for the above grammar:

State(NUMBER+)$ExprExprPrimeTerm
0Shift 3Shift 4Goto 1Goto 2Goto 5
1Accept
2Shift 6Reduce 2Reduce 2
3Shift 3Shift 4Goto 8Goto 9Goto 5
4Reduce 4Reduce 4Reduce 4Reduce 4
5Shift 6
6Shift 3Shift 4Goto 2Goto 5
  • State 0: If the next token is ( or NUMBER, the parser will shift to State 3 or State 4 respectively.

  • State 1: The input is accepted if $ is the lookahead.

  • State 2: Depending on the lookahead, the parser may reduce using ExprPrime -> ε (if the next token is ) or $).

  • State 6: If the lookahead is '+', the parser will shift and continue processing ExprPrime.

Parsing Example

Let’s parse the expression (3 + 4) using the LR(1) parser:

  1. Stack: Initially, the stack contains the start state 0, and the lookahead is '('.

  2. Shift: The parser consults the table for State 0 and shifts ( to State 3.

  3. Shift: In State 3, the next token 3 is shifted to State 4.

  4. Reduce: In State 4, the parser reduces NUMBER to Term, and the stack becomes 0 Term 5.

  5. Shift: In State 5, the parser shifts + to State 6.

  6. Shift: In State 6, the next token 4 is shifted to State 4.

  7. Reduce: The parser reduces NUMBER to Term again, and the stack becomes 0 Term + Term.

  8. Reduce: The parser reduces ExprPrime -> '+' Term, resulting in Expr.

  9. Shift: Finally, the parser shifts ) and successfully reduces the input.

Code Outline (Pseudocode)

Here’s a simplified pseudocode structure for an LR(1) parser:

def parse():
    stack = [0]  # Start with the initial state
    lookahead = get_next_token()

    while True:
        state = stack[-1]
        action = action_table[state][lookahead]

        if action == 'shift':
            stack.append(lookahead)
            stack.append(new_state)
            lookahead = get_next_token()
        elif action == 'reduce':
            # Pop symbols according to the production's length
            production = get_production(action)
            stack = stack[:-2 * len(production.rhs)]
            non_terminal = production.lhs
            goto_state = goto_table[stack[-1]][non_terminal]
            stack.append(non_terminal)
            stack.append(goto_state)
        elif action == 'accept':
            return "Input accepted"
        else:
            return "Error"

The action table determines whether to shift or reduce, while the goto table directs the parser to the correct state after a reduction.

LALR(1) Parsing: Practical and Efficient LR Parsing

Motivation

While Canonical LR(1) parsers are extremely powerful and capable of handling any context-free grammar, they suffer from a practical limitation: their parsing tables tend to be quite large. As the number of states in the automaton increases with the complexity of the grammar, so does the memory requirement for the parsing table. LALR(1) (Look-Ahead LR) parsers provide a clever compromise between the power of LR(1) parsers and the efficiency needed for real-world usage.

LALR(1) parsers work by merging similar states in the LR(1) automaton, significantly reducing the size of the parsing table, while still using one token of lookahead. This makes LALR(1) parsers practical for use in compilers (like those generated by tools such as Yacc or Bison), which is why many programming languages rely on them.

What It Does

An LALR(1) parser is constructed by merging certain states in the LR(1) automaton that have the same core but differ only in their lookahead sets. This reduces the number of states without losing the ability to handle the same class of grammars as LR(1). The merging process ensures that the resulting parser is still able to handle the correct lookahead in any situation, but with a smaller table size.

In short, LALR(1) parsers have the same parsing power as LR(1) but require far fewer states and transitions. This reduction in table size makes them a practical choice for compiler construction.

What It Cannot Do

LALR(1) parsers are a subset of LR(1) parsers, meaning they can handle the same class of grammars. However, because states are merged, LALR(1) parsers can sometimes encounter shift/reduce or reduce/reduce conflicts that would not exist in a canonical LR(1) parser. In practice, most grammars used in programming languages can be rewritten or adapted to avoid these conflicts, making LALR(1) parsers highly useful.

Example: Arithmetic Grammar for LALR(1)

We’ll use the same arithmetic grammar we used for the LR(1) example:

Expr    -> Term ExprPrime
ExprPrime -> '+' Term ExprPrime | ε
Term    -> '(' Expr ')' | NUMBER

This grammar supports simple arithmetic expressions with numbers, addition, and parentheses.

Merging LR(1) States

To construct an LALR(1) parser, we first generate the LR(1) automaton, then look for states that have the same core items but differ in their lookahead sets. For example, consider two LR(1) items that share the same core but differ in lookahead:

Expr -> Term • ExprPrime, {+, $}
Expr -> Term • ExprPrime, {)}

In the Canonical LR(1) automaton, these would be two separate states. However, since their cores (the part before the lookahead) are identical, LALR(1) parsers merge these states into one, combining the lookahead sets {+, $, )}.

By merging such states, the number of states and the overall size of the parsing table is reduced.

Parsing Table for LALR(1)

Here is an example LALR(1) parsing table for our grammar:

State(NUMBER+)$ExprExprPrimeTerm
0Shift 3Shift 4Goto 1Goto 2Goto 5
1Accept
2Shift 6Reduce 2Reduce 2
3Shift 3Shift 4Goto 8Goto 9Goto 5
4Reduce 4Reduce 4Reduce 4Reduce 4
5Shift 6
6Shift 3Shift 4Goto 2Goto 5
  • State 0: If the next token is ( or NUMBER, the parser shifts to State 3 or State 4.

  • State 1: If the lookahead is $, the input is accepted.

  • State 2: Depending on the lookahead, the parser either reduces ExprPrime -> ε or shifts + to State 6.

  • State 6: The parser shifts +, allowing for another addition operation.

Parsing Example

Let’s parse the expression 3 + 4 using the LALR(1) parser:

  1. Stack: Initially, the stack contains the start state 0, and the lookahead is 3.

  2. Shift: The parser shifts 3 to State 4.

  3. Reduce: In State 4, the parser reduces NUMBER to Term. The stack becomes 0 Term 5.

  4. Shift: The parser shifts + to State 6.

  5. Shift: The next token 4 is shifted to State 4.

  6. Reduce: The parser reduces NUMBER to Term again, and the stack becomes 0 Term + Term.

  7. Reduce: The parser reduces ExprPrime -> '+' Term, combining the terms.

  8. Shift: Finally, the parser reduces the input to Expr, completing the parse.

Code Outline (Pseudocode)

Here’s a simplified pseudocode for an LALR(1) parser:

def parse():
    stack = [0]  # Start with the initial state
    lookahead = get_next_token()

    while True:
        state = stack[-1]
        action = action_table[state][lookahead]

        if action == 'shift':
            stack.append(lookahead)
            stack.append(new_state)
            lookahead = get_next_token()
        elif action == 'reduce':
            production = get_production(action)
            stack = stack[:-2 * len(production.rhs)]
            non_terminal = production.lhs
            goto_state = goto_table[stack[-1]][non_terminal]
            stack.append(non_terminal)
            stack.append(goto_state)
        elif action == 'accept':
            return "Input accepted"
        else:
            return "Error"

In this pseudocode, the parser uses the action and goto tables to determine whether to shift, reduce, or accept based on the current state and lookahead token.

Exercises

  1. Recursive Descent Parsing Exercise

Objective: Understand the working of recursive descent parsers by manually writing a parser for a simple grammar.

Task:

  • Given the following grammar for simple Boolean expressions, write a recursive descent parser in pseudocode or a programming language of your choice.
Expr   -> Term ExprPrime
ExprPrime -> 'OR' Term ExprPrime | ε
Term   -> Factor TermPrime
TermPrime -> 'AND' Factor TermPrime | ε
Factor  -> 'NOT' Factor | '(' Expr ')' | ID
  • Parse the following expression using your recursive descent parser: (ID AND NOT ID) OR ID

    Questions:

    • What are the recursive functions for each non-terminal?

    • How does the parser handle precedence between AND and OR?

    • What happens if you introduce left recursion in this grammar? Rewrite a left-recursive rule in non-left-recursive form.

  1. LL(1) Parsing Table Construction

Objective: Develop skills to construct LL(1) parsing tables and apply them in parsing.

Task:

  • Given the grammar below, construct an LL(1) parsing table:
S  -> A B
A  -> 'a' A | ε
B  -> 'b' B | ε
  • Create the FIRST and FOLLOW sets for each non-terminal.

  • Construct the LL(1) parsing table.

  • Manually simulate parsing the string "aabb" using the table.

    Questions:

    • What challenges did you face while constructing the parsing table?

    • Can this grammar be parsed by an LL(1) parser without modification?

    • Why is this grammar suitable for LL(1) parsing?

  1. LR(0) Shift-Reduce Parsing

Objective: Gain experience with shift-reduce parsing and the LR(0) parsing mechanism.

Task:

  • For the following grammar, construct the LR(0) items and the LR(0) parsing table:
S -> L '=' R
S -> R
L -> '*' R
L -> 'ID'
R -> L
  • Parse the input string ID = *ID using the LR(0) parsing table and show the contents of the stack at each step.

    Questions:

    • At what points does the parser perform a shift action, and when does it reduce?

    • What is the main limitation of the LR(0) parser for this grammar? Are there any conflicts? How would LR(1) resolve them?

  1. Canonical LR(1) Parsing Table Construction

Objective: Understand and apply Canonical LR(1) parsing techniques.

Task:

  • Given the following grammar, construct the LR(1) items and the Canonical LR(1) parsing table:
S  -> A 'b'
A  -> 'c' A | 'd'
  • Use your table to parse the input string "d b" and "c d b".

    Questions:

    • How does the lookahead help resolve conflicts that would arise in LR(0)?

    • What is the size of the LR(1) parsing table compared to LR(0)? Can you explain the difference?

    • What kind of conflicts does the lookahead resolve that LR(0) could not?

  1. LALR(1) Parser State Merging

Objective: Gain experience with merging states in LALR(1) parsers and understanding its effect.

Task:

  • Construct the LR(1) items for the grammar below, then merge the states to create the LALR(1) automaton and parsing table:
S  -> 'a' A 'c' | 'b' A 'd'
A  -> 'e'
  • Compare the size of the Canonical LR(1) and LALR(1) tables for this grammar.

    Questions:

    • Which states were merged, and why?

    • Does the LALR(1) parser still function correctly for this grammar, or does merging states introduce any new conflicts?

    • Discuss the practical advantages of LALR(1) over Canonical LR(1) in real-world compiler design.

  1. General Challenge: Grammar Classification

Task: For each of the following grammars, classify whether it is suitable for parsing using Recursive Descent, LL(1), LR(0), Canonical LR(1), or LALR(1) parsers, and justify your choice:

  1.  S -> 'a' S 'b' | ε
    
  2.  S -> A 'c'
     A -> 'a' A | 'b'
    
  3.  S -> A A
     A -> 'a' | ε
    

Questions:

  • For each grammar, indicate whether it can be parsed by a Recursive Descent parser. If not, how would you modify the grammar to make it parsable by Recursive Descent?

  • Identify if each grammar is LL(1). If not, is it because of ambiguity or left recursion?

  • For each grammar, determine if it can be parsed by an LR(0) parser or if it requires lookahead.

Introduction to Parser Generators

Parser generators are tools that automatically generate parsers from a given grammar. Instead of manually implementing parsing functions for each grammar rule, you provide a description of the grammar to the tool, and it generates the necessary code to parse inputs according to that grammar. This approach saves time, reduces errors, and is especially helpful for large or complex grammars.

There are several widely-used parser generators available:

  1. ANTLR (Another Tool for Language Recognition): One of the most popular parser generators, supporting a wide range of languages. ANTLR generates parsers in multiple programming languages, such as Java, Python, C#, and JavaScript.

  2. Bison: The GNU version of Yacc (Yet Another Compiler Compiler), a classic parser generator used primarily with C or C++.

  3. JavaCC (Java Compiler Compiler): Specifically designed for Java, this tool is used to generate parsers in the Java language.

  4. Yacc: The original Unix parser generator, often used with Lex (a lexical analyzer generator). Yacc is similar to Bison but works with C.

In this lesson, we’ll focus on ANTLR because of its modern design, rich features, support for multiple languages, and ease of use.

Writing a Simplified C Grammar and Parser Using ANTLR

Step 1: Install ANTLR

Before starting, you need to install ANTLR. Follow these steps:

  1. Download the latest ANTLR jar from ANTLR’s official website.

  2. Set up a project in your preferred language (Java, Python, C#, etc.) that can integrate ANTLR.

  3. Add ANTLR’s jar file to your project dependencies and install the ANTLR runtime library for your chosen language.

For example, in a Java project, you can use Maven with this dependency:

<dependency>
    <groupId>org.antlr</groupId>
    <artifactId>antlr4-runtime</artifactId>
    <version>4.9.2</version>
</dependency>

Once you have the setup ready, you can start writing the grammar.

Step 2: Create a Simplified C Grammar

Here’s a simplified grammar that can parse basic C-like constructs such as variable declarations, if-else statements, and function calls:

grammar SimpleC;

// Parser Rules
prog:   stat+ ;

stat:   decl ';'
    |   assign ';'
    |   if_stat
    |   func_call ';'
    ;

decl:   'int' ID ('=' expr)? ;
assign: ID '=' expr ;
if_stat: 'if' '(' expr ')' stat ('else' stat)? ;
func_call: ID '(' (expr (',' expr)*)? ')' ;

expr:   expr ('+' | '-') expr 
    |   expr ('*' | '/') expr 
    |   '(' expr ')' 
    |   NUMBER 
    |   ID
    ;

// Lexer Rules
ID: [a-zA-Z_][a-zA-Z_0-9]* ;
NUMBER: [0-9]+ ;
WS: [ \t\n\r]+ -> skip ;

Explanation:

  • prog: The root rule, which expects a sequence of statements (stat+).

  • stat: Can be a variable declaration (decl), an assignment (assign), an if statement (if_stat), or a function call (func_call).

  • decl: Handles variable declarations like int x = 10;.

  • assign: Represents an assignment like x = 5;.

  • if_stat: Represents if statements, optionally followed by an else block.

  • func_call: Handles simple function calls, e.g., foo(3, 4);.

  • expr: Represents expressions, supporting arithmetic operations and parentheses.

  • ID and NUMBER are lexer rules to identify variable names and numeric values.

  • WS is the whitespace rule, which skips spaces, tabs, and newlines.

Step 3: Generate the Parser

Now that we have the grammar, we need to generate the parser using ANTLR.

  1. Run ANTLR to generate the parser files from the grammar. If you're working in a command-line environment, you can do this with:
java -jar antlr-4.x-complete.jar SimpleC.g4

This will generate several files, including:

  • A Lexer to break the input into tokens.

  • A Parser to apply the rules.

  • Supporting classes and data structures (like the parse tree).

  1. In an IDE or a build system like Maven, you can include the generated files and ANTLR runtime in your project.

Step 4: Write Code to Use the Parser

Now that the parser and lexer have been generated, let's write code to use them to parse a simple C-like program. Here’s an example in Java.

import org.antlr.v4.runtime.*;
import org.antlr.v4.runtime.tree.*;

public class SimpleCParserDemo {
    public static void main(String[] args) throws Exception {
        // Sample input C-like program
        String program = "int x = 5; x = x + 2; if (x > 0) { foo(x); }";

        // Step 1: Create a CharStream to read the input
        CharStream input = CharStreams.fromString(program);

        // Step 2: Create a lexer that feeds off of the input CharStream
        SimpleCLexer lexer = new SimpleCLexer(input);

        // Step 3: Create a buffer of tokens
        CommonTokenStream tokens = new CommonTokenStream(lexer);

        // Step 4: Create a parser that feeds off the tokens buffer
        SimpleCParser parser = new SimpleCParser(tokens);

        // Step 5: Begin parsing at the 'prog' rule
        ParseTree tree = parser.prog();

        // Print the parse tree (LISP-style) for debugging
        System.out.println(tree.toStringTree(parser));
    }
}

Explanation:

  • CharStreams.fromString(program): Reads the C-like input program as a character stream.

  • SimpleCLexer: The lexer generated by ANTLR from our grammar, which breaks the input into tokens.

  • CommonTokenStream: Buffers the tokens produced by the lexer.

  • SimpleCParser: The parser generated by ANTLR from our grammar, which processes the token stream based on the grammar.

  • ParseTree tree = parser.prog(): Starts parsing the input from the prog rule and generates a parse tree.

  • toStringTree(parser): Prints the parse tree in a LISP-like format for easy debugging.

Step 5: Running the Parser

When you run the Java code with the sample input "int x = 5; x = x + 2; if (x > 0) { foo(x); }", the output will be a parse tree, showing how the input is structured according to the grammar.

Sample output (simplified for readability):

(prog (decl int x = (expr 5)) ; 
      (assign x = (expr (expr x) + (expr 2))) ; 
      (if_stat if ( (expr (expr x) > (expr 0)) ) (stat (func_call foo ( (expr x) )))))

This output shows that the program has been successfully parsed:

  • The first line declares the variable x.

  • The second line assigns x + 2 to x.

  • The third line parses an if statement, calling the function foo(x).

Why This Matters

  • Lexer and Parser Separation: ANTLR separates the lexer and parser, making it easier to handle different levels of grammar processing. The lexer turns input into tokens, while the parser handles the structure and meaning of those tokens.

  • Tree Structure: The parse tree produced by ANTLR gives you an explicit representation of the input structure, which you can traverse or manipulate for tasks like semantic analysis or code generation.

  • Error Handling: ANTLR automatically includes robust error handling and reporting in the generated parser. If the input doesn't conform to the grammar, ANTLR can provide informative error messages.

Exercises

  1. Modify the Grammar:

    • Add support for while loops and variable types like float and char.

    • Add return statements for functions.

  2. Generate Parse Tree:

    • Extend the provided Java code to visualize the parse tree using an external library (e.g., org.antlr.v4.gui.TreeViewer).
  3. Error Recovery:

    • Introduce syntax errors in the input (e.g., missing semicolons or parentheses) and observe how ANTLR handles them. Modify the grammar to improve error reporting.
  4. Semantic Analysis:

    • After building the parse tree, implement a pass that checks for undeclared variables or incorrect type usage.