Lecture 5

Syntax Error Recovery
Parsing: The Big Picture

1.0 - Syntax Error Recovery

Programmers make errors - how do we recover from these errors?
There is no such thing as perfect error recovery
We focus on error recovery that handles simple common errors.

1.1 - Recovery Strategy for Recursive-Descent Parsing

Perform local error recovery on matching a single token
Synchronise the input stream at the start and end of each parse method (parse method → trying to parse a non-terminal symbol)

1.1.1 - Recovery Strategy for Matching a Single Token (Terminal Symbol)

What if the current token isn’t KW_DO?

void parseWhileStatement() {
  tokens.match(Token.KW_WHILE);
  parseCondition();
  tokens.match(Token.KW_DO); // An error is encountered at this step.
  parseStatement();
}

It might be missing:

$\text{while i < 5\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ begin x := x + i * i; i := i + 1 end}$
- In this case, we notice that the next token is $\text{begin}$ is something that starts a statement (and a statement is the next thing to be parsed)
- We can assume that the keyword do is missing, and continue parsing as normal.
A token may be erroneously inserted.

$\text{while i < 5 {\color{ff7f7f}end} do begin x := x * i; i := i + 1 end}$
A single erroneous token replacing the expected token

$\text{while i < 5 {\color{ff7f7f}then} begin x := x * i; i := i + 1 end}$
When the next token is not what we expected it to be, we look at the next token along:
- If after skipping a token, the next token is what we expected it to be, we know that a token was erroneously inserted
- If after skipping a token, the next token isn’t what we expected it to be, we assume that it was a replacement for the token that we were looking for and try to parse the following tokens.
We can replace the original tokens.match method with one that has two arguments:
- tokens.match(token_to_match, proceeding_set)
- token_to_match the token that we’re trying to match to
- proceeding set a set of all the tokens that could start the following section / statement
```
void parseWhileStatement() {
  tokens.match(Token.KW_WHILE);
  parseCondition();
  tokens.match(Token.KW_DO, STATEMENT_START_SET); 
  parseStatement();
}
```
- If the current token is KW_DO, match it;
- Else if the current token is in STATEMENT_START_SET, assume KW_DO was missing and take no further action
- Else skip the current token and:
  - if the current token is KW_DO, assume that the previous token was erroneously inserted and match the new token;
  - Else assume that the expected token was replaced by the previous token; take no further action

1.1.2 - Recovery Strategy for Parse Method (Non-Terminal Symbol)

We perform synchronisation at the start and end of the parse methods

void parseN(TokenSet recoverSet) {
  // Skip tokens until the current token is in the expected set or recoverSet
  // if the new current token is not in expected
      // Return as though N had been recognised
      // Returning an error node / value if required
  // else, recognise the token N
      recog(N)
  // Skip tokens until the current token is in recoverSet
}

expected is in the set of terminal symbols that are valid at the start of parsing N
recoverSet contains the terminal symbols that can follow N in the context in which N is being parsed

Essentially removing the garbage at the start and end.

1.1.3 - Implementation using Method Parse

We don’t want to hard-code the behaviour that we proposed before for every non-terminal symbol. Let’s create a pattern that does this more succinctly.

void parseN(TokenSet recoverSet) {
  parse("N", N_START_SET, recoverSet, () -> {
    recog(N)
  });
}

The parameters of the method parse are:
- The name of the rule for use in error messages
- The set of tokens expected at start of rule
- Set of tokens to recover at on a syntax error
- A function that parses the non-terminal.
The parse method performs the synchronisation at the start and end of the parseN method

So from this, we can modify our parseWhileStatement method from:

void parseWhileStatement() {
  tokens.match(Token.KW_WHILE);
  parseCondition();
  tokens.match(Token.KW_DO, STATEMENT_START_SET); 
  parseStatement();
}

To the following:

void parseWhileStatement(TokenSet recoverSet) {
// Name for debugging, rule start set, recover set, non-terminal parsing function
parse("While Statement", Token.KW_WHILE, recoverSet, 
  () -> {
    // Assume that synchronisation has occurred, and 
    // The current token is KW_WHILE

    // Since synchronisation has occurred, the current token is KW_WHILE.
    // Cannot fail as a result of this, so don't need to use the recoverable 
    // method version of tokens.match.
    tokens.match(Token.KW_WHILE); 
    parseCondition(recoverSet.union(Token.KW_DO)); // [1]
    tokens.match(Token.KW_DO, STATEMENT_START_SET); // [2]
    parseStatement(recoverSet); // [3]
  });  
}

[1] Note that our recoverSet is the union of the existing recoverSet (that is, the set of tokens on which the WhileStatement can recover on and KW_DO (as we are up to parsing the condition attached to the while).
- Also note that recoverSet.union(...) doesn’t modify the original recoverSet variable, it returns a new instance
[2] The arguments to token.match(...) with recovery are:
1. The token to try to match
2. The set of tokens that can follow
[3] Since no token can follow (for the purpose of syntax recovery), we just use the non-recover version of token.match(...)

1.1.4 - Recovery Strategy Example for `RelCondition`

Suppose we wanted to parse a RelCondition, which is defined by the following production

\text{RelCondition}\rightarrow\text{Exp[RelOp Exp]}

void parseRelStatement(TokenSet recoverSet) {
  parse("RelCondition", REL_COND_START_SET, recoverSet, 
    // Anonymous parse method for parsing a RelCondition
    () -> {
        // The current token is in REL_COND_START_SET
        parseExp(recoverSet.union(REL_OPS_SET));  // [1]
        // Parse the optional [RelOp Exp] portion of the production [2]
        if (tokens.isIn(REL_OPS_SET)) {
            parseRelOp(recoverSet.union(EXP_START_SET); 
            parseExp(recoverSet);
        }
    });
}

[1] The recoverSet for parsing the expression is the union of:
- Anything that RelCondition could recover on (this is in recoverSet already)
- Anything that an Expression could recover on
[2] When parsing the RelOp set, the recoverSet would be comprised of
- The recoverSet for the RelCondition
- Anything that could start an expression

1.2 - Parsing: The Big Picture

During the syntax analysis phase, the compiler needs to
- Recognise the grammatical structure of a sequence of lexical tokens;
- While performing error recovery where possible, and; recover as best as we can so we can report as much of the error to the user as possible
- Build an Abstract Syntax Tree (AST) and symbol table

1.2.1 - Parsing and Building an AST

Consider parsing the following production with error recovery.

$\text{Factor}\rightarrow\text{LPAREN Condition RPAREN | NUMBER | LValue}$

Key - Building AST | Error Recovery | Parsing Code

ExpNode parseFactor(TokenSet recoverSet) {
    // Use exp.parse at it returns an expression node
    // Identical function to 
    return exp.parse("Factor", FACTOR_START_SET, recoverSet, 
        () -> {
            // Assume synchronisation has been performed, 
            // the current token is in FACTOR_START_SET
            ExpNode result = null;

            // There are three symbols that could start a Factor production:
            // LPAREN, NUMBER or LVALUE. Parse each of them.
            if (token.isMatch(Token.LPAREN)) {
                // Cannot fail so don't use recovery version
                tokens.match(Token.LPAREN); 
                result = parseCondition(recoverSet.union(Token.RPAREN));
                // Note that we use the recoverset as the set of tokens that can
                // follow the Token.RPAREN - we don't actually know what comes after
                // but we can give it ALL of the possible tokens that could come
                // after + some more (could reduce this set based on context).
                tokens.match(Token.RPAREN, recoverSet); // Can fail
            } else if (tokens.isMatch(Token.NUMBER)) {
                result = new ExpNode.ConstNode(tokens.getLocation(),
                    Predefined.INTEGER_TYPE, tokens.getIntValue());
                tokens.match(Token.NUMBER); // Cannot fail, move pointer past token
            } else if (tokens.isMatch(Token.IDENTIFIER)) {
                result = parseLValue(recoverSet);
            } else {
                // Defensive programming in the case FACTOR_START_SET 
                // has an extra unintended token.
                fatal("parseFactor");
            }
        return result;
    });
}

This code here integrates three key things:
1. Recognising the tokens
2. Building the abstract syntax tree
3. Performing (simple) error recovery.

1.2.2 - Construction of Expression Parse Method

Notice that in the code above, we have used exp.parse. How do we construct this?

The code to construct the expression parser exp
```
ParseMethod<ExpNode> exp = 
    new ParseMethod<>(
        ...
        (Location loc) -> new ExpNode.ErrorNode(loc));
```
- Notice here that the ParseMethod is configured to return an ExpNode.
- The fourth parameter of this function (the anonymous function) must also return an ExpNode.