Week 2.2

1.0 - Backus-Naur Form (BNF) and EBNF

1.1 - Backus-Naur Form (BNF)

1.1.1 - BNF Syntax for Alternatives

1.2 - Extended Backus-Naur Form (EBNF)

Whilst EBNF doesn’t extend the expressivity of BNF, it adds three extra constructs to our BNF form to make it easier to specify our grammars. Since these constructs don’t add any expressivity, we can convert any grammar defined in EBNF into BNF.

  1. An optional syntactic construct, written in square brackets

    RelConditionExp[RelOp Exp] \text{RelCondition}\rightarrow \text{Exp} [\text{RelOp Exp}]
  2. A repetition construct, written in curly brackets

    • The items in the brackets can appear 0 or more times.
    StatementListStatement{SEMICOLON Statement} \text{StatementList}\rightarrow\text{Statement}\{\text{SEMICOLON Statement}\}
  3. A grouping of constructs, written in parentheses

    TermFactor{(TIMES | DIVIDE) Factor} \text{Term}\rightarrow\text{Factor} \{\text{(TIMES | DIVIDE) Factor}\}

1.2.1 - Converting EBNF Constructs to BNF

2.0 - Recursive-Descent Parsing

2.1 - Pattern-Matching a Terminal Symbol

ENBF Construct S Recogniser code recog(S) Notes
T tokens.match(Token.T); Make sure the token is what we want it to be, consume it and move on
IDENTIFIER tokens.match(Token.IDENTIFIER)

2.2 - Pattern-Matching a Non-Terminal Symbol

EBNF Construct S Recogniser code recog(S) Notes
N parseN(); Call the corresponding parse method for the non-terminal symbol.
RelCondition parseRelCondition(); When parsing a RelCondition, we’d call the parseRelCondition(); method to parse it.

2.3 - Matching a Sequence of Terms

So we have these tokens that have been recognised. How do we parse it?

EBNF Construct S Recogniser code recog(S)
S1 S2  SnS_1\ S_2\ \cdots\ S_n recog(S1S_1)
recog(S2S_2)
recog(S3S_3)
LPAREN RelCondition RPAREN tokens.match(Token.LPAREN);
parseRelCondition();
tokens.match(Token.RPAREN);

2.4 - Matching a list of Alternatives

To implement this, we need to look at the current token + the following tokens - it’s not necessarily always possible to write a recursive descent parser that just looks at the current token.

EBNF Construct S Recogniser code recog(S)
S1  S2    SnS_1\ \vert\ S_2\ \vert\ \cdots\ \vert\ S_n if (tokens.isIn(First(S1S_1))) {
    recog(S1S_1)
} else if (tokens.isIn(First(S2S_2))) {
    recog(S2S_2)
} else ...{
\vdots
} else if (tokens.isIn(First(SnS_n))) {
    recog(SnS_n)
} else { /* Omitted if Nullable*/
    errors.error(”Syntax Error”);}

2.4.1 - Matching a List of Alternatives - Example

FactorLPAREN RelCondition RPAREN | NUMBER | LValueFactor\rightarrow\text{LPAREN RelCondition RPAREN | NUMBER | LValue}

void parseFactor(){
  if (tokens.isMatch(Token.LPAREN)) {
    tokens.match(Token.LPAREN);
    parseRelCondition();
    tokens.match(Token.RPAREN);
  } else if (tokens.isMatch(Token.NUMBER)) {
    tokens.match(Token.NUMBER);
  } else if (tokens.isMatch(Token.IDENTIFIER)) {
		parseLValue();
  } else {
    errors.error("Syntax Error");
    /* Or if we have a symbol that's Nullable, assume that we've matched the 
     * Nullable symbol.
     */
  }
}

2.5 - Matching an Optional Construct

EBNF Construct S Recogniser code recog(S)
[S][S] if (tokens.isIn(First(S1S_1)) {
    recog(S1S_1)
} /* Else don’t do anything. */

2.5.1 - Matching an Optional Construct - Example

RelConditionExp[RelOp Exp]\text{RelCondition} \rightarrow\text{Exp[RelOp Exp]}

void parseRelCondition() {
  parseExp(); // Recognise the expression
  if (tokens.isIn(REL_OPS_SET)) { // Can the current token start the RelOp
    parseRelOp();
    parseExp();
  }
}

Where REL_OPS_SET contains the set of terminal symbols representing relational operators.


IfStatementKW_IF Condition KW_THEN Statement [KW_ELSE Statement]\text{IfStatement}\rightarrow\text{KW\_IF Condition KW\_THEN Statement [KW\_ELSE Statement]}

void parseIfStatement() {
  tokens.match(Token.KW_IF);
  parseCondition();
  tokens.match(Token.KW_THEN);
  parseStatement();
  if (tokens.isMatch(Token.KW_ELSE)) {
    tokens.match(Token.KW_ELSE);
    parseStatement();
  }
}

Whilst the grammar doesn’t satisfy the rules required, we can use our parser to resolve that ambiguity (matching else to the closest IF)

2.6 - Matching a Repetition Construct

EBNF Construct S Recogniser code recog(S)
{S}\{S\} while (tokens.isIn(First(S))) {
    recog(S)
}

2.6.1 - Matching a Repetition Construct - Example

TermFactor{(TIMES | DIVIDE) Factor}\text{Term}\rightarrow\text{Factor\{(TIMES | DIVIDE) Factor\}}

void parseTerm(){
  parseFactor();
  while(tokens.isIn(TERMS_OPS_SET)){
		// Step 1: Match either a Tokens.TIMES or Tokens.DIVIDE or throw an error
    if(tokens.isMatch(Tokens.TIMES)){
      tokens.match(Tokens.TIMES);
    } else if (tokens.isMatch(Token.DIVIDE)) {
      tokens.match(Tokens.DIVIDE);
    } else {
      fatal("Internal Error); // Theoretically unreachable
    }
    // Step 2: Match the factor
		parseFactor();
  }
}

2.7 - Summary

3.0 - Parsing and Computing Expressions

Let’s now write a recursive descent parser that not only recognises an expression, but calculates it assuming that there are no syntacticial errors.

  1. Firstly, let’s look at parsing a factor. The production for a Factor\text{Factor} is:

    FactorLPAREN Exp RPAREN | NUMBER\text{Factor}\rightarrow\text{LPAREN Exp RPAREN | NUMBER}

    parseFactor() {
      
      if (tokens.isMatch(Token.LPAREN)) {
        tokens.match(Token.LPAREN);
                 parseExp();
        tokens.match(Token.RPAREN);
      } else if (tokens.isMatch(Token.NUMBER)) {
        
        tokens.match(Token.NUMBER);
      } else {
        errors.error("Syntax error");
      }
    
    }
    
  2. Evaluate the expression and store the result in a local variable.

    parseFactor() {
      int result;
      if (tokens.isMatch(Token.LPAREN)) {
        tokens.match(Token.LPAREN);
        result = parseExp();
        tokens.match(Token.RPAREN);
      } else if (tokens.isMatch(Token.NUMBER)) {
        result = tokens.getIntValue(); // Get the integer value of the current token,
                                       // assuming it's a number
        tokens.match(Token.NUMBER);    // Moves to the next token, so the line before 
                                       // MUST be before
      } else {
        errors.error("Syntax error");
        result = 0x8080808080;         // Set to a random garbage value 
      }
      return result;
    }
    
  3. Next, let’s look at parsing a Term\text{Term}

    TermFactor {(TIMES | DIVIDE) Factor}\text{Term}\rightarrow\text{Factor \{(TIMES | DIVIDE) Factor\}}

    int parseTerm() {
      int result;
      result = parseFactor(); // Result returned if times or divide not in expression
      while (tokens.isIn(TERMS_OPS_SET)) { // Times or divide
        boolean times;
        if (tokens.isMatch(Token.TIMES)) {
          tokens.match(Token.TIMES);
          times = true;
        } else if (tokens.isMatch(Token.DIVIDE)) {
          tokens.match(Token.DIVIDE);
          times = false;
        } else {
          fatal("Internal Error"); // Unreachable.
        }
    		int factor = parseFactor(); // The inner factor
    		if (times) {
          result = result * factor;
        } else {
          result = result / factor;
        }
      }
    	return result;
    }
    
  4. Next, let’s look at parsing a Exp[PLUS | MINUS] Term{(PLUS | MINUS) Term}\text{Exp}\rightarrow\text{[PLUS | MINUS] Term\{(PLUS | MINUS) Term\}}

    int parseExp() {
      boolean isPlus = false; 
      boolean isMinus = false;
    
      if (tokens.isMatch(Token.PLUS)) {
        tokens.match(Token.PLUS);
        isPlus = true;
      } else if (tokens.isMatch(Token.MINUS)){
        tokens.match(Token.MINUS);
        isMinus = true;
      }
    
      int term = parseTerm();
      parseExp();
    }
    

4.0 - Building an Abstract Syntax Tree (AST)

4.1 - Building an Abstract Syntax Tree (AST) - Parsing Example

We now want our parseTerm method to return an ExpNode (Expression Node)

ExpNode parseTerm() {
  ExpNode result = parseFactor();
  while (tokens.isIn(TERM_OPS_SET)) {
    Operator op = Operator.INVALID_OP;
    Location opLoc = tokens.getLocation(); // Location of times or divide. Used in AST.
    if (tokens.isMatch(Token.TIMES)) {
      tokens.match(Token.TIMES);
      op = Operator.MUL_OP;
    } else if (tokens.isMatch(Token.DIVIDE)) {
      tokens.match(Token.DIVIDE);
      op = Operator.DIV_OP;
    } else {
      fatal("Internal error"); // Unreachable
    }
    ExpNode right = parseFactor();
    result = new ExpNode.BinaryNode(opLoc, op, result, right);
  }
  return result;
}

4.2 - Building an AST - Visual Example

Let’s build the AST for 23/42*3/4

  1. At the start, our current token points to 22.

    • So parseFactor() will return an AST for 22
    • Note that l1l_1 denotes Location 1 in this example but should be replaced by a line number and column number.
  2. Our current token now points to *

    • We recognise that it is either a TIMES or DIVIDE - and so we parse the sequence {(TIMES | DIVIDE) Factor}\text{\{(TIMES | DIVIDE) Factor\}}.
    • We know that the location for the start of this sequence is Location 2
    • We also know that the operation is a multiplication operation (MUL_OP)
  3. Our current token now points to 33

    • We now parse the Factor\text{Factor} component of the sequence {(TIMES | DIVIDE) Factor}\text{\{(TIMES | DIVIDE) Factor\}}

    • We now have the LHS and RHS components of our BinaryOperator, so we can add pointers to them to create the AST for the binary operator.

  4. Our current token now points to // (the divide operator)

    • Our current token is a divide symbol, so we can’t terminate the loop - we have to read in another occurrence of {(TIMES | DIVIDE) Factor}\text{\{(TIMES | DIVIDE) Factor\}}
    • From the information at hand, we know that the operator is located at Location 4 (l4l_4) and that it’s a division operator DIV_OP
    • In reading in the BinaryOperator token, we move our current token pointer to the factor 44.
    • As before, we parse it
  5. Our current pointer now points to nothing / null

    • As before, we put together our Abstract Syntax Tree
  6. We can terminate as the pointer is no longer pointing at a MUL_OP or DIV_OP

4.3 - Recursive Descent Parser So Far...