Lecture 1
1.0 - Phases of a Compiler
These notes are from Phases of a Compiler - Ian Hayes 2019
- The image below shows the phases of compiler (blue boxes) and the input/output of each phase (white boxes).
- The first input to the compiler is the source code which is treated as a text file or sequence of characters.

1.1 - A Note
-
The steps of a compiler goes through can be broken down into analysis and synthesis
-
In the analysis phase, the compiler breaks up the source program into constituent pieces (tokens) and imposes a grammatical structure on them.
- Uses the tokens to create an intermediate representation of the source program (in the form of a syntax tree)
- If a syntactic error is detected in the source program, the compiler must provide informative messages to the user
-
In the synthesis phase, the compiler generates the program for the target machine using the intermediate representation (syntax tree) and the information contained in the symbol table.
1.2 - Lexical Analysis
Lexical Analysis is the step in the compiler where the source code is converted into
lexical tokens
- Input A sequence of characters representing a program
- Output Sequence of lexical tokens
- The lexical analyser ignores whitespace, blanks, tabs, newlines, carriage returns, form feeds
- Whitespace is still important, as it is used to delimit lexical tokens.
- Comments are treated as whitespace
1.2.1 - Lexical Analysis Example
-
Suppose we have the following input (or sequence of characters)
if x < 0 then z := -x else: z := x
-
The output of the lexical analysis would be
KW_IF, IDENTIFIER("x"), LESS, NUMBER(0), KW_THEN, IDENTIFIER("z"), ASSIGN, MINUS, IDENTIFIER("x"), KW_ELSE, IDENTIFIER("z"), ASSIGN, IDENTIFIER("x")
-
The grammar of the language can be used to extract the structure of the program statement from its sequence of lexical tokens.
-
The structure can be described using a concrete syntax tree.
1.3 - Syntax Analysis
Given a series of lexical tokens, generate an abstract syntax tree and a symbol table.
- Input Series of lexical tokens generated by the lexical analysis phase.
- Output An Abstract Syntax Tree (AST) and symbol table.
- The symbol table contains important information about all of the identifiers that are defined in the program (+ a few predefined ones)
- May be organised into scopes (e.g. identifiers within a procedure; may be organised hierarchically)
- The syntax tree is an intermediate representation of the target program that depicts the grammatical structure of the text input (source code)
- In this phase, we also identify any syntactical errors in the source code.
1.4 - Type Checking (Static Semantic Analysis / Static Analysis)
Check for type and other static semantic errors
- Input Symbol Table and Abstract Syntax Tree (AST)
- Output Updated Symbol Table and Abstract Syntax Tree (AST)
- Resolves all references to identifiers
- Updates the symbol table entries with type information for each identifier
- Checks abstract syntax tree for type correctness; type checking is performed
- e.g. the compiler may check if the value for an array index is an integer, and will report an error if a floating-point number is used as an index into an array.
- Updates abstract syntax tree with type coercions
A binary arithmetic operator may be applied to either a pair of integers or to a pair of floating-point numbers. If the operator is applied to a floating-point number and an integer, the compiler may convert / coerce the integer into a floating-point number.
1.5 - Code Generation
Given the updated Symbol Table and Abstract Syntax Tree, generate code for the target machine.
- Input: Symbol Table and Updated Abstract Syntax Tree (AST)
- Output Code for the target machine
- The code generation phase may include:
- Machine-independent optimisation
- Machine-dependent optimisations
- Instruction selection
- Register allocation
- At this point, we know that the syntax and static semantics of our program (source code) is correct.
2.0 - Compilers vs Interpreters
What is the difference between compilers and interpreters? Interpreters are essentially the same as compilers, except they don’t perform the last step of Code Generation.
- Compilers output programs compiled for a target machine
2.1 - About Interpreters
Interprets the abstract syntax tree directly to execute the program
- Input: A symbol table and updated Abstract Syntax Tree (AST)
- Less time spent on the compilation step (as compared to a Compiler) as no code generated
- Slower to execute the program (as code generation is done statement by statement as required)
- More commonly used for high-level dynamically typed languages
- Can typically give better error diagnostics than a compiler because it executes the statements in the source program sequentially.
3.0 - Syntax of PL0
PL0 is an educational language, used for teaching Compilers and Interpreters. It is a subset of the Pascal language.
-
There are two default types in Pascal - booleans and integers
var b: boolean; // Boolean r: int; // Integer
-
We can also define our own types:
const C = 42; // A constant integer type S = [0..C]; var x: S; // Subrange type
-
ln PL0, semicolons (;) are used as statement separators not statement terminators.
begin i := 0; res := 0; // Other statements here i := i + 1 // This statement doesn't need a semicolon as there no statements after end
-
When defining a while loop, if there are multiple statements in its body, it gets wrapped in a
begin
andend
clause.while i != x do res += res + i; end
// For multiple statements; while i != x do begin res := res + i; i := i + 1 end end
-
Putting it all together, we get the following program:
const C = 42; // in PL0 we can define constants type // We can also define our own variable types S = [0..C]; // Subrange 0 to 42 var b: boolean; // Boolean variable r: int; // Integer variable x: S; // Subrange type (defined above) res: int; // Another integer procedure sum() = // Procedures start out with declarations var i: S; // Local variable to sum res: int; // Masks the scope of other `res` (if we choose to define res again here) begin i := 0; res := 0; while i != x do begin res := res + i; i := i + 1 end // while loop contents end // while loop begin // main read r; b := ( r <= C); // b is a boolean if b then x := r else x := 0; call sum(); write res end