# Having Fun with Parsers

Imagine you're speaking a language only your computer can understand, and it's hanging on to every word you say, trying to figure out exactly what you mean. This is essentially what a **parser** does—it reads your code (or any structured input) and tries to make sense of it according to a set of rules. But parsers aren’t just abstract tools; they’re the foundation of compilers, interpreters, and even natural language processing systems. Learning about parsers helps you understand not just how computers process code, but how they make decisions and resolve ambiguities in the languages we use to instruct them.

In this journey, we’ll dive into **parsers**, starting from an intuitive understanding of how they work, why they’re important, and how to get your hands dirty with actual code that parses a grammar. Along the way, you'll see that parsers aren’t just serious tools—they can be a lot of fun to build and experiment with!

### Recursive Descent Parsing

#### Motivation

At the heart of any programming language is its **grammar**—a set of rules that define how valid sentences (or expressions) are structured. A **recursive descent parser** is one of the simplest and most intuitive ways to implement a parser, especially when you’re getting started. It breaks down the parsing process into a series of functions, where each function is responsible for recognizing a specific part of the grammar. The recursive nature allows it to easily handle nested structures, which are common in most programming languages.

#### What It Does

A recursive descent parser works by reading the input (like code) one piece at a time and trying to match it against the grammar’s rules. Each rule corresponds to a function, and the parser recursively calls these functions based on what it encounters in the input. If the input conforms to the rules, the parser successfully builds a syntax tree. If not, it throws an error.

For example, if you're parsing arithmetic expressions like `3 + (4 * 5)`

, the parser can break this down into smaller parts: numbers, operators, and parentheses, and process them in a recursive fashion.

#### What It Cannot Do

While recursive descent parsers are easy to understand and implement, they do have limitations. The biggest is that they cannot handle **left-recursive grammars**. A left-recursive grammar is one where a rule refers to itself as its first step, such as:

```
Expr -> Expr + Term
```

This leads to infinite recursion. To use recursive descent, we need to rewrite such rules in a non-left-recursive form. Recursive descent parsers also typically work with **LL(1) grammars**, meaning they only look one symbol ahead in the input to make decisions.

#### Example: Arithmetic Grammar

Let’s look at a simple grammar for arithmetic expressions:

```
Expr -> Term ExprPrime
ExprPrime -> '+' Term ExprPrime | ε
Term -> Factor TermPrime
TermPrime -> '*' Factor TermPrime | ε
Factor -> '(' Expr ')' | NUMBER
```

This grammar can handle addition (`+`

) and multiplication (`*`

) with parentheses and numbers.

**Expr**represents an expression.**Term**represents a term in an expression (multiplication part).**Factor**is either a number or a parenthesized expression.**ε**represents the empty string, meaning the rule can terminate.

#### Parsing Example

Let’s parse the expression `3 + 4 * 5`

using the above grammar:

**Expr**starts by calling**Term**to parse the first part of the expression.**Term**calls**Factor**, which recognizes`3`

as a number.**ExprPrime**checks for a`+`

after the term and finds it, so it proceeds to call**Term**again for the next part of the expression.**Term**calls**Factor**again, which now sees`4`

.**TermPrime**looks for a`*`

operator and finds it, so it parses the next factor`5`

.After all rules have been satisfied, the parser confirms that the input is valid.

The result is a parsed expression with the correct precedence: `3 + (4 * 5)`

.

#### Code Outline (Pseudocode)

```
def parse_expr():
parse_term()
parse_expr_prime()
def parse_expr_prime():
if next_token == '+':
consume('+')
parse_term()
parse_expr_prime()
def parse_term():
parse_factor()
parse_term_prime()
def parse_term_prime():
if next_token == '*':
consume('*')
parse_factor()
parse_term_prime()
def parse_factor():
if next_token == '(':
consume('(')
parse_expr()
consume(')')
elif next_token == NUMBER:
consume(NUMBER)
```

In this pseudocode, each function corresponds to a rule in the grammar. The parser consumes tokens one by one and recursively applies the rules.

### LL(1) Parsing: Predicting with One Symbol of Lookahead

#### Motivation

While recursive descent parsing is intuitive and works well for many grammars, it can become inefficient or even unusable when grammars grow more complex or contain left-recursion. To address these issues, we can use a more structured approach known as an **LL(1) parser**. The "LL" stands for **Left-to-right parsing** and **Leftmost derivation**, while the "1" means the parser uses **one symbol of lookahead** to decide which rule to apply. This approach gives us more control over how to parse the input and ensures that we can build efficient and predictable parsers.

LL(1) parsers use a **parsing table** and avoid recursion, making them suitable for automated parser generation tools (e.g., Yacc, Bison, ANTLR). This makes them a preferred choice for many compilers.

#### What It Does

An LL(1) parser reads the input one symbol at a time and uses a **lookahead** symbol to decide which production rule to apply. The parser constructs a **parsing table** based on the grammar, where each row corresponds to a non-terminal and each column corresponds to a lookahead symbol. The table contains entries that specify which rule to use when a non-terminal is encountered with a certain lookahead.

The key advantage of LL(1) parsers is their predictability and efficiency in parsing top-down. Each decision is made using a single lookahead, so parsing is fast and deterministic.

#### What It Cannot Do

LL(1) parsers, like recursive descent parsers, cannot handle **left-recursive grammars**. Moreover, they are restricted to grammars where the correct rule can be determined with only one lookahead symbol. Grammars that require more than one lookahead (LL(k) grammars) or that contain significant ambiguities can’t be parsed by an LL(1) parser.

#### Example: Arithmetic Grammar for LL(1)

We’ll use a modified version of the arithmetic grammar from the recursive descent parser, rewritten to fit the LL(1) approach:

```
Expr -> Term ExprPrime
ExprPrime -> '+' Term ExprPrime | ε
Term -> Factor TermPrime
TermPrime -> '*' Factor TermPrime | ε
Factor -> '(' Expr ')' | NUMBER
```

This grammar defines the structure of basic arithmetic expressions with addition and multiplication. It is **not** left-recursive and is suitable for LL(1) parsing.

#### Parsing Table

For the given grammar, the parsing table would look something like this:

Non-Terminal | `+` | `*` | `(` | `)` | `NUMBER` | `EOF` |

Expr | `Term ExprPrime` | `Term ExprPrime` | ||||

ExprPrime | `+ Term ExprPrime` | `ε` | `ε` | |||

Term | `Factor TermPrime` | `Factor TermPrime` | ||||

TermPrime | `ε` | `* Factor TermPrime` | `ε` | `ε` | ||

Factor | `( Expr )` | `NUMBER` |

For

`Expr`

, when the lookahead is`(`

or`NUMBER`

, the parser uses the rule`Expr -> Term ExprPrime`

.For

`ExprPrime`

, if the lookahead is`+`

, it expands to`+ Term ExprPrime`

. If it sees a`)`

or`EOF`

, it uses the ε production.The other rules similarly follow the grammar’s structure.

#### Parsing Example

Let’s parse `3 + 4 * 5`

using the LL(1) parsing table:

**Stack**: Initially, the stack has the start symbol`Expr`

, and the lookahead is the first token`3`

.The stack has

`Expr`

, and using the table for`Expr`

with lookahead`NUMBER`

, the parser expands`Expr -> Term ExprPrime`

.The stack now contains

`Term ExprPrime`

. The parser consults the table for`Term`

with lookahead`NUMBER`

, and expands`Term -> Factor TermPrime`

.The stack now contains

`Factor TermPrime`

. Using the table for`Factor`

with lookahead`NUMBER`

, it expands to`NUMBER`

, and the parser matches the token`3`

with`NUMBER`

.The lookahead now becomes

`+`

, and the parser consults the table for`ExprPrime`

with`+`

, expanding`ExprPrime -> + Term ExprPrime`

.The stack now contains

`+ Term ExprPrime`

. The parser matches`+`

and advances to the next token,`4`

.Repeating this process, the parser continues expanding

`Term`

,`Factor`

, and`TermPrime`

according to the lookahead tokens until the entire input is consumed and the stack is empty.

The result is a successful parse of the expression `3 + 4 * 5`

.

#### Code Outline (Pseudocode)

```
def parse_expr():
parse_term()
parse_expr_prime()
def parse_expr_prime():
if lookahead == '+':
consume('+')
parse_term()
parse_expr_prime()
# If lookahead is ')' or EOF, ε production
def parse_term():
parse_factor()
parse_term_prime()
def parse_term_prime():
if lookahead == '*':
consume('*')
parse_factor()
parse_term_prime()
# If lookahead is ')' or EOF, ε production
def parse_factor():
if lookahead == '(':
consume('(')
parse_expr()
consume(')')
elif lookahead == NUMBER:
consume(NUMBER)
```

In this pseudocode, instead of recursion, the parser consults the parsing table at each step and expands based on the current lookahead token. The process is deterministic and ensures that every step is guided by the table.

### LR(0) Parsing: Shift-Reduce Parsing Without Lookahead

#### Motivation

While LL(1) parsers focus on top-down parsing and require a clear decision-making process for each step, **LR(0) parsers** are a type of **bottom-up parser**. They work by gradually constructing a parse tree by starting from the input tokens and combining them into higher-level structures (non-terminals) until they reduce the entire input to the start symbol of the grammar.

An **LR(0) parser** is called a **shift-reduce parser** because it shifts input symbols onto a stack and then reduces a group of symbols according to the grammar’s production rules. One of the key advantages of LR parsers is their power—they can handle a wider range of grammars than LL parsers, including some ambiguous grammars. However, the simplest form of LR parsing, LR(0), does not use any lookahead to decide which production to apply, making it both powerful but sometimes too eager in its reductions.

#### What It Does

An LR(0) parser works by maintaining a stack of symbols and states. It alternates between two actions:

**Shift**: Read the next token from the input and push it onto the stack.**Reduce**: Apply a production rule by replacing a group of symbols on the stack with the corresponding non-terminal according to the grammar.

At each step, the parser consults a **parsing table** to decide whether to shift or reduce. This table is constructed from an **LR(0) automaton** based on the grammar.

The parser also has an **initial state** and tracks its current state through a set of transitions. The **goto** function is used to move between states based on the non-terminal symbols after a reduction.

#### What It Cannot Do

Because LR(0) parsers do not use any lookahead, they sometimes encounter situations where they cannot decide whether to shift or reduce, leading to **shift/reduce conflicts**. This limits the number of grammars that LR(0) parsers can handle. To deal with these conflicts, more advanced versions like **LR(1)** and **LALR(1)** introduce lookahead symbols to make better decisions.

Additionally, LR(0) parsers struggle with grammars that are not **LR(0) grammars**, which means they can’t handle ambiguities or cases where multiple reductions could apply without lookahead.

#### Example: Simple Grammar for LR(0)

Consider the following grammar for arithmetic expressions with addition:

```
Expr -> Expr '+' Term
Expr -> Term
Term -> NUMBER
```

This grammar can handle simple expressions like `3 + 4 + 5`

, where `Expr`

is made up of terms connected by `+`

, and each `Term`

is a number.

#### LR(0) Automaton

To build an LR(0) parser, we first construct an **LR(0) automaton**. This automaton consists of **items**, which are production rules with a **dot (•)** indicating how much of the rule has been processed. For example, the item:

```
Expr -> Expr • '+' Term
```

means that the parser has already seen the `Expr`

part of the rule and is now expecting a `'+'`

followed by `Term`

.

Here are the **LR(0) items** for our grammar:

```
(1) Expr -> • Expr '+' Term
(2) Expr -> • Term
(3) Term -> • NUMBER
(4) Expr -> Expr • '+' Term
(5) Expr -> Term •
```

These items are organized into states, and transitions between the states occur based on the input tokens.

#### Parsing Table for LR(0)

The parsing table consists of two parts:

**Action table**: Determines whether to shift or reduce based on the current state and the next input symbol.**Goto table**: Determines which state to transition to after a reduction, based on the non-terminal that was reduced.

State | NUMBER | `+` | `$` | Expr | Term |

0 | Shift 3 | Goto 1 | Goto 2 | ||

1 | Shift 4 | Accept | |||

2 | Reduce 2 | Reduce 2 | |||

3 | Reduce 3 | Reduce 3 | |||

4 | Shift 3 | Goto 5 | |||

5 | Reduce 1 | Shift 4 |

**State 0**: If the next token is`NUMBER`

, the parser will shift the token and move to**State 3**.**State 1**: If the next token is`'+'`

, the parser shifts the token and moves to**State 4**.**State 3**: If the input token is`NUMBER`

, the parser reduces using the rule`Term -> NUMBER`

.**State 5**: Reduce using`Expr -> Expr '+' Term`

.

#### Parsing Example

Let’s parse the expression `3 + 4`

using the LR(0) parser:

**Stack**: Initially, the stack contains the start state`0`

, and the lookahead is`3`

.**Shift**: The parser consults the table for**State 0**and sees that for`NUMBER`

, it should shift to**State 3**. The stack becomes`0 NUMBER 3`

.**Reduce**: The parser now consults**State 3**and reduces`NUMBER`

to`Term`

. The stack becomes`0 Term 2`

.**Shift**: The next input is`+`

, and in**State 1**, the table tells the parser to shift to**State 4**. The stack becomes`0 Expr 1 + 4`

.**Shift**: The next input is`4`

, so the parser shifts to**State 3**. The stack becomes`0 Expr 1 + 4 NUMBER 3`

.**Reduce**: The parser reduces`NUMBER`

to`Term`

, and the stack becomes`0 Expr 1 + 4 Term 2`

.**Reduce**: The parser reduces`Expr -> Expr + Term`

, leading to the final reduction. The input is successfully parsed as`Expr`

.

#### Code Outline (Pseudocode)

Here’s a simplified pseudocode structure for an LR(0) parser:

```
def parse():
stack = [0] # Start with the initial state
while True:
state = stack[-1]
lookahead = get_next_token()
action = action_table[state][lookahead]
if action == 'shift':
stack.append(lookahead)
stack.append(new_state)
elif action == 'reduce':
# Pop symbols corresponding to the production
stack = stack[:-2 * length_of_production]
non_terminal = lhs_of_production
goto_state = goto_table[stack[-1]][non_terminal]
stack.append(non_terminal)
stack.append(goto_state)
elif action == 'accept':
return "Input accepted"
else:
return "Error"
```

The **action table** determines whether to shift or reduce, and the **goto table** determines how to transition after reductions.

### Canonical LR(1) Parsing: Powerful Parsing with Lookahead

#### Motivation

While LR(0) parsers are effective for simple grammars, they encounter conflicts in more complex situations, especially when the parser has to decide between shifting or reducing without enough information. **Canonical LR(1)** parsers address this by introducing a **lookahead** of **1 token** to guide decisions. The “1” in LR(1) signifies that the parser uses one token of lookahead to choose between conflicting parsing actions, allowing it to handle a much broader range of grammars, including those with ambiguities that LR(0) parsers cannot resolve.

Canonical LR(1) parsers are extremely powerful and can handle all **context-free grammars**. This makes them an excellent choice for building compilers that require robust parsing capabilities. However, they do come with the cost of more complex parsing tables and larger automata, which is addressed in LALR parsers (a more efficient variant).

#### What It Does

An LR(1) parser uses **items** (just like in LR(0)), but with an added **lookahead symbol** to help make decisions during parsing. This lookahead is included in the construction of the **LR(1) automaton**, and it enables the parser to differentiate between similar grammar rules by checking what comes next in the input. The parser still operates as a **shift-reduce** parser, but now it consults both the current state and the lookahead symbol to decide whether to shift, reduce, or accept.

The key difference between LR(1) and LR(0) is the extra power granted by the lookahead, which eliminates many of the conflicts that arise in LR(0) parsing.

#### What It Cannot Do

Though Canonical LR(1) parsers are extremely powerful, they are often seen as impractical for large grammars because of the large size of the parsing tables. The sheer number of states and transitions in the automaton can be overwhelming. This is why in practical applications, **LALR(1)** parsers are often used instead, as they combine the power of LR(1) with the efficiency of smaller tables.

#### Example: Simple Grammar for LR(1)

Consider the following grammar for arithmetic expressions with addition and parentheses:

```
Expr -> Term ExprPrime
ExprPrime -> '+' Term ExprPrime | ε
Term -> '(' Expr ')' | NUMBER
```

This grammar can handle simple expressions with numbers, addition, and parentheses.

#### LR(1) Items

To build an LR(1) parser, we start with **LR(1) items**, which look like LR(0) items but include a **lookahead**. For example, an LR(1) item might look like this:

```
Expr -> Term • ExprPrime, {+, $}
```

This item indicates that we are in the middle of parsing the rule `Expr -> Term ExprPrime`

, and our lookahead symbol is either `'+'`

or the end of input (`$`

). The lookahead helps the parser decide whether to continue parsing `ExprPrime`

or reduce.

Here are some of the **LR(1) items** for the grammar:

```
(1) Expr -> • Term ExprPrime, {$}
(2) ExprPrime -> • '+' Term ExprPrime, {$}
(3) Term -> • '(' Expr ')', {+, $}
(4) Term -> • NUMBER, {+, $}
(5) Expr -> Term • ExprPrime, {$}
(6) ExprPrime -> '+' • Term ExprPrime, {$}
```

#### LR(1) Automaton

The **LR(1) automaton** consists of states constructed from the LR(1) items. Each state represents a set of items, and transitions between states occur based on the next symbol in the input and the lookahead symbol.

#### Parsing Table for LR(1)

The parsing table for an LR(1) parser is similar to that of LR(0), but it uses the lookahead symbol to make decisions. Here's an example parsing table for the above grammar:

State | `(` | `NUMBER` | `+` | `)` | `$` | Expr | ExprPrime | Term |

0 | Shift 3 | Shift 4 | Goto 1 | Goto 2 | Goto 5 | |||

1 | Accept | |||||||

2 | Shift 6 | Reduce 2 | Reduce 2 | |||||

3 | Shift 3 | Shift 4 | Goto 8 | Goto 9 | Goto 5 | |||

4 | Reduce 4 | Reduce 4 | Reduce 4 | Reduce 4 | ||||

5 | Shift 6 | |||||||

6 | Shift 3 | Shift 4 | Goto 2 | Goto 5 |

**State 0**: If the next token is`(`

or`NUMBER`

, the parser will shift to**State 3**or**State 4**respectively.**State 1**: The input is accepted if`$`

is the lookahead.**State 2**: Depending on the lookahead, the parser may reduce using`ExprPrime -> ε`

(if the next token is`)`

or`$`

).**State 6**: If the lookahead is`'+'`

, the parser will shift and continue processing`ExprPrime`

.

#### Parsing Example

Let’s parse the expression `(3 + 4)`

using the LR(1) parser:

**Stack**: Initially, the stack contains the start state`0`

, and the lookahead is`'('`

.**Shift**: The parser consults the table for**State 0**and shifts`(`

to**State 3**.**Shift**: In**State 3**, the next token`3`

is shifted to**State 4**.**Reduce**: In**State 4**, the parser reduces`NUMBER`

to`Term`

, and the stack becomes`0 Term 5`

.**Shift**: In**State 5**, the parser shifts`+`

to**State 6**.**Shift**: In**State 6**, the next token`4`

is shifted to**State 4**.**Reduce**: The parser reduces`NUMBER`

to`Term`

again, and the stack becomes`0 Term + Term`

.**Reduce**: The parser reduces`ExprPrime -> '+' Term`

, resulting in`Expr`

.**Shift**: Finally, the parser shifts`)`

and successfully reduces the input.

#### Code Outline (Pseudocode)

Here’s a simplified pseudocode structure for an LR(1) parser:

```
def parse():
stack = [0] # Start with the initial state
lookahead = get_next_token()
while True:
state = stack[-1]
action = action_table[state][lookahead]
if action == 'shift':
stack.append(lookahead)
stack.append(new_state)
lookahead = get_next_token()
elif action == 'reduce':
# Pop symbols according to the production's length
production = get_production(action)
stack = stack[:-2 * len(production.rhs)]
non_terminal = production.lhs
goto_state = goto_table[stack[-1]][non_terminal]
stack.append(non_terminal)
stack.append(goto_state)
elif action == 'accept':
return "Input accepted"
else:
return "Error"
```

The **action table** determines whether to shift or reduce, while the **goto table** directs the parser to the correct state after a reduction.

### LALR(1) Parsing: Practical and Efficient LR Parsing

#### Motivation

While **Canonical LR(1)** parsers are extremely powerful and capable of handling any context-free grammar, they suffer from a practical limitation: their parsing tables tend to be quite large. As the number of states in the automaton increases with the complexity of the grammar, so does the memory requirement for the parsing table. **LALR(1)** (Look-Ahead LR) parsers provide a clever compromise between the power of LR(1) parsers and the efficiency needed for real-world usage.

LALR(1) parsers work by merging similar states in the LR(1) automaton, significantly reducing the size of the parsing table, while still using one token of lookahead. This makes LALR(1) parsers practical for use in compilers (like those generated by tools such as Yacc or Bison), which is why many programming languages rely on them.

#### What It Does

An LALR(1) parser is constructed by **merging certain states** in the LR(1) automaton that have the same core but differ only in their lookahead sets. This reduces the number of states without losing the ability to handle the same class of grammars as LR(1). The merging process ensures that the resulting parser is still able to handle the correct lookahead in any situation, but with a smaller table size.

In short, LALR(1) parsers have the same parsing power as LR(1) but require far fewer states and transitions. This reduction in table size makes them a practical choice for compiler construction.

#### What It Cannot Do

LALR(1) parsers are a subset of LR(1) parsers, meaning they can handle the same class of grammars. However, because states are merged, LALR(1) parsers can sometimes encounter **shift/reduce** or **reduce/reduce conflicts** that would not exist in a canonical LR(1) parser. In practice, most grammars used in programming languages can be rewritten or adapted to avoid these conflicts, making LALR(1) parsers highly useful.

#### Example: Arithmetic Grammar for LALR(1)

We’ll use the same arithmetic grammar we used for the LR(1) example:

```
Expr -> Term ExprPrime
ExprPrime -> '+' Term ExprPrime | ε
Term -> '(' Expr ')' | NUMBER
```

This grammar supports simple arithmetic expressions with numbers, addition, and parentheses.

#### Merging LR(1) States

To construct an LALR(1) parser, we first generate the LR(1) automaton, then look for states that have the same **core items** but differ in their **lookahead sets**. For example, consider two LR(1) items that share the same core but differ in lookahead:

```
Expr -> Term • ExprPrime, {+, $}
Expr -> Term • ExprPrime, {)}
```

In the Canonical LR(1) automaton, these would be two separate states. However, since their cores (the part before the lookahead) are identical, LALR(1) parsers merge these states into one, combining the lookahead sets `{+, $, )}`

.

By merging such states, the number of states and the overall size of the parsing table is reduced.

#### Parsing Table for LALR(1)

Here is an example LALR(1) parsing table for our grammar:

State | `(` | `NUMBER` | `+` | `)` | `$` | Expr | ExprPrime | Term |

0 | Shift 3 | Shift 4 | Goto 1 | Goto 2 | Goto 5 | |||

1 | Accept | |||||||

2 | Shift 6 | Reduce 2 | Reduce 2 | |||||

3 | Shift 3 | Shift 4 | Goto 8 | Goto 9 | Goto 5 | |||

4 | Reduce 4 | Reduce 4 | Reduce 4 | Reduce 4 | ||||

5 | Shift 6 | |||||||

6 | Shift 3 | Shift 4 | Goto 2 | Goto 5 |

**State 0**: If the next token is`(`

or`NUMBER`

, the parser shifts to**State 3**or**State 4**.**State 1**: If the lookahead is`$`

, the input is accepted.**State 2**: Depending on the lookahead, the parser either reduces`ExprPrime -> ε`

or shifts`+`

to**State 6**.**State 6**: The parser shifts`+`

, allowing for another addition operation.

#### Parsing Example

Let’s parse the expression `3 + 4`

using the LALR(1) parser:

**Stack**: Initially, the stack contains the start state`0`

, and the lookahead is`3`

.**Shift**: The parser shifts`3`

to**State 4**.**Reduce**: In**State 4**, the parser reduces`NUMBER`

to`Term`

. The stack becomes`0 Term 5`

.**Shift**: The parser shifts`+`

to**State 6**.**Shift**: The next token`4`

is shifted to**State 4**.**Reduce**: The parser reduces`NUMBER`

to`Term`

again, and the stack becomes`0 Term + Term`

.**Reduce**: The parser reduces`ExprPrime -> '+' Term`

, combining the terms.**Shift**: Finally, the parser reduces the input to`Expr`

, completing the parse.

#### Code Outline (Pseudocode)

Here’s a simplified pseudocode for an LALR(1) parser:

```
def parse():
stack = [0] # Start with the initial state
lookahead = get_next_token()
while True:
state = stack[-1]
action = action_table[state][lookahead]
if action == 'shift':
stack.append(lookahead)
stack.append(new_state)
lookahead = get_next_token()
elif action == 'reduce':
production = get_production(action)
stack = stack[:-2 * len(production.rhs)]
non_terminal = production.lhs
goto_state = goto_table[stack[-1]][non_terminal]
stack.append(non_terminal)
stack.append(goto_state)
elif action == 'accept':
return "Input accepted"
else:
return "Error"
```

In this pseudocode, the parser uses the **action** and **goto** tables to determine whether to shift, reduce, or accept based on the current state and lookahead token.

**Exercises**

**Recursive Descent Parsing Exercise**

**Objective**: Understand the working of recursive descent parsers by manually writing a parser for a simple grammar.

**Task**:

- Given the following grammar for simple Boolean expressions, write a
**recursive descent parser**in pseudocode or a programming language of your choice.

```
Expr -> Term ExprPrime
ExprPrime -> 'OR' Term ExprPrime | ε
Term -> Factor TermPrime
TermPrime -> 'AND' Factor TermPrime | ε
Factor -> 'NOT' Factor | '(' Expr ')' | ID
```

Parse the following expression using your recursive descent parser:

`(ID AND NOT ID) OR ID`

**Questions**:What are the recursive functions for each non-terminal?

How does the parser handle precedence between

`AND`

and`OR`

?What happens if you introduce left recursion in this grammar? Rewrite a left-recursive rule in non-left-recursive form.

**LL(1) Parsing Table Construction**

**Objective**: Develop skills to construct LL(1) parsing tables and apply them in parsing.

**Task**:

- Given the grammar below, construct an
**LL(1) parsing table**:

```
S -> A B
A -> 'a' A | ε
B -> 'b' B | ε
```

Create the FIRST and FOLLOW sets for each non-terminal.

Construct the LL(1) parsing table.

Manually simulate parsing the string

`"aabb"`

using the table.**Questions**:What challenges did you face while constructing the parsing table?

Can this grammar be parsed by an LL(1) parser without modification?

Why is this grammar suitable for LL(1) parsing?

**LR(0) Shift-Reduce Parsing**

**Objective**: Gain experience with shift-reduce parsing and the LR(0) parsing mechanism.

**Task**:

- For the following grammar, construct the
**LR(0) items**and the**LR(0) parsing table**:

```
S -> L '=' R
S -> R
L -> '*' R
L -> 'ID'
R -> L
```

Parse the input string

`ID = *ID`

using the LR(0) parsing table and show the contents of the stack at each step.**Questions**:At what points does the parser perform a shift action, and when does it reduce?

What is the main limitation of the LR(0) parser for this grammar? Are there any conflicts? How would LR(1) resolve them?

**Canonical LR(1) Parsing Table Construction**

**Objective**: Understand and apply Canonical LR(1) parsing techniques.

**Task**:

- Given the following grammar, construct the
**LR(1) items**and the**Canonical LR(1) parsing table**:

```
S -> A 'b'
A -> 'c' A | 'd'
```

Use your table to parse the input string

`"d b"`

and`"c d b"`

.**Questions**:How does the lookahead help resolve conflicts that would arise in LR(0)?

What is the size of the LR(1) parsing table compared to LR(0)? Can you explain the difference?

What kind of conflicts does the lookahead resolve that LR(0) could not?

**LALR(1) Parser State Merging**

**Objective**: Gain experience with merging states in LALR(1) parsers and understanding its effect.

**Task**:

- Construct the
**LR(1) items**for the grammar below, then merge the states to create the**LALR(1) automaton**and parsing table:

```
S -> 'a' A 'c' | 'b' A 'd'
A -> 'e'
```

Compare the size of the Canonical LR(1) and LALR(1) tables for this grammar.

**Questions**:Which states were merged, and why?

Does the LALR(1) parser still function correctly for this grammar, or does merging states introduce any new conflicts?

Discuss the practical advantages of LALR(1) over Canonical LR(1) in real-world compiler design.

**General Challenge: Grammar Classification**

**Task**: For each of the following grammars, classify whether it is suitable for parsing using Recursive Descent, LL(1), LR(0), Canonical LR(1), or LALR(1) parsers, and justify your choice:

`S -> 'a' S 'b' | ε`

`S -> A 'c' A -> 'a' A | 'b'`

`S -> A A A -> 'a' | ε`

**Questions**:

For each grammar, indicate whether it can be parsed by a Recursive Descent parser. If not, how would you modify the grammar to make it parsable by Recursive Descent?

Identify if each grammar is LL(1). If not, is it because of ambiguity or left recursion?

For each grammar, determine if it can be parsed by an LR(0) parser or if it requires lookahead.

### Introduction to Parser Generators

Parser generators are tools that automatically generate parsers from a given grammar. Instead of manually implementing parsing functions for each grammar rule, you provide a description of the grammar to the tool, and it generates the necessary code to parse inputs according to that grammar. This approach saves time, reduces errors, and is especially helpful for large or complex grammars.

#### Popular Parser Generators

There are several widely-used parser generators available:

**ANTLR (Another Tool for Language Recognition)**: One of the most popular parser generators, supporting a wide range of languages. ANTLR generates parsers in multiple programming languages, such as Java, Python, C#, and JavaScript.**Bison**: The GNU version of Yacc (Yet Another Compiler Compiler), a classic parser generator used primarily with C or C++.**JavaCC (Java Compiler Compiler)**: Specifically designed for Java, this tool is used to generate parsers in the Java language.**Yacc**: The original Unix parser generator, often used with Lex (a lexical analyzer generator). Yacc is similar to Bison but works with C.

In this lesson, we’ll focus on **ANTLR** because of its modern design, rich features, support for multiple languages, and ease of use.

### Writing a Simplified C Grammar and Parser Using ANTLR

#### Step 1: Install ANTLR

Before starting, you need to install ANTLR. Follow these steps:

Download the latest ANTLR jar from ANTLR’s official website.

Set up a project in your preferred language (Java, Python, C#, etc.) that can integrate ANTLR.

Add ANTLR’s jar file to your project dependencies and install the ANTLR runtime library for your chosen language.

For example, in a **Java project**, you can use Maven with this dependency:

```
<dependency>
<groupId>org.antlr</groupId>
<artifactId>antlr4-runtime</artifactId>
<version>4.9.2</version>
</dependency>
```

Once you have the setup ready, you can start writing the grammar.

#### Step 2: Create a Simplified C Grammar

Here’s a simplified grammar that can parse basic C-like constructs such as variable declarations, if-else statements, and function calls:

```
grammar SimpleC;
// Parser Rules
prog: stat+ ;
stat: decl ';'
| assign ';'
| if_stat
| func_call ';'
;
decl: 'int' ID ('=' expr)? ;
assign: ID '=' expr ;
if_stat: 'if' '(' expr ')' stat ('else' stat)? ;
func_call: ID '(' (expr (',' expr)*)? ')' ;
expr: expr ('+' | '-') expr
| expr ('*' | '/') expr
| '(' expr ')'
| NUMBER
| ID
;
// Lexer Rules
ID: [a-zA-Z_][a-zA-Z_0-9]* ;
NUMBER: [0-9]+ ;
WS: [ \t\n\r]+ -> skip ;
```

**Explanation:**

**prog**: The root rule, which expects a sequence of**statements**(`stat+`

).**stat**: Can be a variable declaration (`decl`

), an assignment (`assign`

), an`if`

statement (`if_stat`

), or a function call (`func_call`

).**decl**: Handles variable declarations like`int x = 10;`

.**assign**: Represents an assignment like`x = 5;`

.**if_stat**: Represents`if`

statements, optionally followed by an`else`

block.**func_call**: Handles simple function calls, e.g.,`foo(3, 4);`

.**expr**: Represents expressions, supporting arithmetic operations and parentheses.**ID**and**NUMBER**are**lexer rules**to identify variable names and numeric values.**WS**is the whitespace rule, which skips spaces, tabs, and newlines.

#### Step 3: Generate the Parser

Now that we have the grammar, we need to generate the parser using ANTLR.

**Run ANTLR**to generate the parser files from the grammar. If you're working in a command-line environment, you can do this with:

```
java -jar antlr-4.x-complete.jar SimpleC.g4
```

This will generate several files, including:

A

**Lexer**to break the input into tokens.A

**Parser**to apply the rules.Supporting classes and data structures (like the parse tree).

- In an IDE or a build system like Maven, you can include the generated files and ANTLR runtime in your project.

#### Step 4: Write Code to Use the Parser

Now that the parser and lexer have been generated, let's write code to use them to parse a simple C-like program. Here’s an example in **Java**.

```
import org.antlr.v4.runtime.*;
import org.antlr.v4.runtime.tree.*;
public class SimpleCParserDemo {
public static void main(String[] args) throws Exception {
// Sample input C-like program
String program = "int x = 5; x = x + 2; if (x > 0) { foo(x); }";
// Step 1: Create a CharStream to read the input
CharStream input = CharStreams.fromString(program);
// Step 2: Create a lexer that feeds off of the input CharStream
SimpleCLexer lexer = new SimpleCLexer(input);
// Step 3: Create a buffer of tokens
CommonTokenStream tokens = new CommonTokenStream(lexer);
// Step 4: Create a parser that feeds off the tokens buffer
SimpleCParser parser = new SimpleCParser(tokens);
// Step 5: Begin parsing at the 'prog' rule
ParseTree tree = parser.prog();
// Print the parse tree (LISP-style) for debugging
System.out.println(tree.toStringTree(parser));
}
}
```

**Explanation:**

**CharStreams.fromString(program)**: Reads the C-like input program as a character stream.**SimpleCLexer**: The lexer generated by ANTLR from our grammar, which breaks the input into tokens.**CommonTokenStream**: Buffers the tokens produced by the lexer.**SimpleCParser**: The parser generated by ANTLR from our grammar, which processes the token stream based on the grammar.**ParseTree tree = parser.prog()**: Starts parsing the input from the`prog`

rule and generates a parse tree.**toStringTree(parser)**: Prints the parse tree in a LISP-like format for easy debugging.

#### Step 5: Running the Parser

When you run the Java code with the sample input `"int x = 5; x = x + 2; if (x > 0) { foo(x); }"`

, the output will be a parse tree, showing how the input is structured according to the grammar.

Sample output (simplified for readability):

```
(prog (decl int x = (expr 5)) ;
(assign x = (expr (expr x) + (expr 2))) ;
(if_stat if ( (expr (expr x) > (expr 0)) ) (stat (func_call foo ( (expr x) )))))
```

This output shows that the program has been successfully parsed:

The first line declares the variable

`x`

.The second line assigns

`x + 2`

to`x`

.The third line parses an

`if`

statement, calling the function`foo(x)`

.

### Why This Matters

**Lexer and Parser Separation**: ANTLR separates the lexer and parser, making it easier to handle different levels of grammar processing. The lexer turns input into tokens, while the parser handles the structure and meaning of those tokens.**Tree Structure**: The parse tree produced by ANTLR gives you an explicit representation of the input structure, which you can traverse or manipulate for tasks like semantic analysis or code generation.**Error Handling**: ANTLR automatically includes robust error handling and reporting in the generated parser. If the input doesn't conform to the grammar, ANTLR can provide informative error messages.

### Exercises

**Modify the Grammar**:Add support for

`while`

loops and variable types like`float`

and`char`

.Add

`return`

statements for functions.

**Generate Parse Tree**:- Extend the provided Java code to visualize the parse tree using an external library (e.g.,
`org.antlr.v4.gui.TreeViewer`

).

- Extend the provided Java code to visualize the parse tree using an external library (e.g.,
**Error Recovery**:- Introduce syntax errors in the input (e.g., missing semicolons or parentheses) and observe how ANTLR handles them. Modify the grammar to improve error reporting.

**Semantic Analysis**:- After building the parse tree, implement a pass that checks for undeclared variables or incorrect type usage.