Aho - Chapter 1.1 to 1.2 Summary
Sections
1.1 - Language Processors 1.2 - The Structure of a Compiler
1.1 - Language Processors
What is the difference between a compiler and an interpreter?
- Before a program can be run, it must be translated into a form which can be executed by a computer - this is done by a compiler.
- A compiler is a program that can read another program in one language (source language) and translate it to another language (target language)

- A compiler also detects and reports errors in the source program that it detects during the translation process.
- If we translate our source program into an executable machine language program, it (the target program) can then be called by the user to process inputs and produce outputs.
- An interpreter is another common type of language processor which directly executes the operations specified in the source program on inputs supplied by the user.

- Machine-language target program produced by compiler is typically much faster than an interpreter at mapping inputs to outputs
- Interpreter can usually give better error diagnostics than a compiler because it executes the statements in the source program sequentially.
1.1.1 - The Java Compiler (Hybrid Compiler)
- The Java language processors combine compilation and interpretation
- A Java source program may first be compiled into an intermediate form called
bytecodes
which are then interpreted on a virtual machine - Some Java compilers use just-in-time compilation, translating bytecodes into machine language immediately before they run the intermediate program to process the input.

1.1.3 - The Preprocessor
- In addition to a compiler, several other programs may be required to create an executable target program as shown below.

- For example, a source program may be divided into modules stored in separate files - the task of collecting (combining) the source program is sometimes entrusted to a separate program called a preprocessor
- The pre-processor may also expand shorthands (macros) into source-language statements.
- The modified source program is then fed into a compiler (which may produce an assembly-language program as its output as it is easier to produce and debug)
- The assembly language is then processed by the assembler that produces relocatable machine code as its output.
- Large programs are often compiled in pieces, so relocatable machine code may have to be linked together with other relocatable object files and library files into the code that actually runs on the machine
- The linker program resolves the external memory addresses, where the code in one file may refer to a location in another file
- The loader then puts together all of the executable object files into memory for execution.
1.1.4 - Exercises for Chapter 1.1
- What is the difference between a compiler and an interpreter?
- A compiler is a program that translates an entire program (from a source language to a target language). An interpreter is a program that translates a program (from a source language to a target language) statement by statement.
- What are the advantages of a compiler over an interpreter?
- A program produced by a compiler is typically faster to execute (i.e. faster at mapping inputs to outputs)
- What are the advantages of an interpreter over an interpreter?
- A program produced by an interpreter can produce better error diagnostics as statements are translated (and ultimately executed) sequentially.
- A compiler that translates a high-level language into another high-level language is called a source-to-source translator. What advantages are there to using C as a target language for a compiler?
- Compiled into a lower level language - compiler from C → target machines
- Don’t have to write compiler for language, can leverage compilers for C
- Just have to focus on compiler from language to C.
- Compiled into a lower level language - compiler from C → target machines
- Describe some of the tasks that an assembler needs to perform
- Translate from assembly language to machine code (Fairhurst, G)
1.2 - The Structure of a Compiler
- The analysis steps break up the source program into constituent pieces and imposes a grammatical structure on them
- It then uses this structure to create an intermediate representation of the source program
- If the analysis part detects that the source program is syntactically ill informed or semantically unsound, it must provide informative messages so that the user can take corrective action
- Analysis also collects information about the source program and stores it in a data structure called a symbol table, which is passed along with the intermediate representation to the synthesis part
- The analysis part is often called the front-end of the compiler
- The synthesis part constructs the desired target program from the intermediate representation and the information in the symbol table.
- The synthesis part is often called the back-end of the compiler.

- In practice, several phases may be grouped together and the intermediate representations between the grouped phases don’t need to be constructed explicitly.
- The symbol table which stores information about the entire source program is used by all phases of the compiler.
- Some compilers have machine-independent optimisation phase between the front-end and the back-end to produce a better target program than it would have other produced from an unoptimised intermediate representation
- Since optimisation is optional, the optimisation phases shown in Figure 1.6 may be missing.
1.2.1 - Lexical Analysis
The first phase of a compiler is called lexical analysis or scanning.
- The lexical analyser reads the source program as a stream of characters and groups the characters into meaningful sequences called lexemes.
-
For each lexeme, the lexical analyser produces as an output token of the form
-
These tokens are then passed onto the next phase - syntax analysis.
-
token-name abstract symbol that is used during syntax analysis.
-
attribute-value pointer to an entry in the symbol for this token
- Information from the symbol table entry is needed for semantic analysis and code generation
-
Lexical Analysis Example
Suppose the source program contains the following assignment statement
position = initial + rate * 60
-
The characters in this assignment could be grouped in the following lexemes and mapped into the following tokens passed onto the syntax analyser
-
position is a lexeme that would be mapped into a token
where is an abstract symbol standing for identifier and 1 points to the symbol table entry for position. - The symbol table entry for an identifier holds information about the identifier, such as its name and type
-
= The assignment symbol is a lexeme that is mapped into the token
. Since this token needs no attribute value, the second component has been omitted. We could have used any abstract symbol such as but for notational convenience, we have chosen to use the lexeme itself as the name of the abstract symbol. -
initial is a lexeme that is mapped into then token
, where 2 points to the symbol-table entry for initial. -
+ is a lexeme that is mapped into the token
-
rate is a lexeme that is mapped into the token
, where 3 points to the symbol-table entry for -
* is a lexeme that is mapped to the token
-
60 is a lexeme that is mapped to the token
(technically speaking, for the lexeme 60 we should generate a token like where 4 points to the symbol table for the internal representation of integer 60, but that will be discussed in a later section.)
-
-
Finally, the sequence of tokens generated would be
- Note that in this representation, the token names
are abstract symbols for the assignment, addition and multiplication operators respectively. - Also note that the blank spaces separating the lexemes would be disregarded by the lexical analyser.
- Note that in this representation, the token names
-
This example can also be written out as:

1.2.2 - Syntax Analysis
The second phase of a compiler is syntax analysis / parsing and uses the first component of the tokens generated by Lexical Analysis to generate an intermediate representation (syntax tree)


- The parser uses the first components of the tokens produced by the lexical analyser to create a tree-like intermediate representation that depicts the grammatical structure of the token stream.
- Typically represented by a syntax tree
- The syntax tree makes the operations and inputs explicit.
- The node
*
indicates that we must multiply the value of rate by the integer 60. - The node
+
indicates that that we must add the result of the multiplication to the value of initial - The root node of the tree labelled
=
indicates that we must store the result of this addition into the location for the identifier
- This ordering of operations is consistent with the usual conventions of arithmetic which tell us that multiplication has higher precedence than addition (and hence the multiplication is to be performed before addition)
- The subsequent phases of the compiler use the grammatical structure to help analyse the source program and generate the target program.
1.2.3 - Semantic Analysis
The Semantic Analyser checks the source program for semantic consistency and gathers type information for subsequent use (& performs type checking).
- Uses the
syntax tree
andsymbol table
to check the source program for semantic consistency with the language definition - Gathers type information and saves it in the syntax tree or symbol table for subsequent use during intermediate code generation
Type Checking
is also performed in the semantic analysis phase- The compiler checks that each operator has matching operands
- E.g. the compiler may check if the value for an array index is an integer, and will report an error if a floating-point number is used as an index into an array
- A language specification may permit some type conversions called coercions
A binary arithmetic operator may be applied to either a pair of integers or to a pair of floating-point numbers. If the operator is applied to a floating-point number and an integer, the compiler may convert / coerce the integer into a floating-point number.

- Note that in the syntax tree above, we may have our variables in our symbol table defined as floating point numbers
- In this case, we add an extra node for the operator
which explicitly converts an integer (in this case 60) into a floating point number.
1.2.4 - Intermediate Code Generation
When translating source code into target code, a compiler may construct one or more intermediate representations (which can have a variety of forms) e.g. syntax trees
- After syntax and semantic analysis of a source program is performed, many compilers generate an explicit low-level of machine-like representation (which we can think of as a program for an abstract machine)
- This intermediate representation should have two key properties:
- It should be easy to produce
- It should be easy to translate into the target machine.
- This intermediate representation should have two key properties:
- Here there’s some discussion about three-address code in the textbook, but I have omitted it as I think it’s irrelevant.
- Simply, three-address code is a sequence of assembly-like instructions with three operations per instruction in which each operand (LHS) can act like a register.
- The output of the code from the section above gives the following three address code.
t1 = inttofloat(60)
t2 = id3 * t1
t3 = id2 + t2
id1 = t3
1.2.5 - Code Optimization
Goal: Improve the intermediate code so that better target code is generated. Typically “better” means faster, but other objectives may be desired, such as shorter code, or target code that consumes less power.
-
A straightforward algorithm may generate intermediate code using an instruction for each operator in the tree representation that comes from the semantic analyser.
-
A simple intermediate code generation algorithm + code optimisation is a reasonable way to generate good target code.
- The optimise can deduce that the conversion from 60 from an integer type to a floating point type can be done once at compilation
- Additionally, t3 is only used once to transmit the value of id1
- With these improvements, the optimiser can transform the initial code to the following optimised code
{ initial code } t1 = inttofloat(60) t2 = id3 * t1 t3 = id2 + t2 id1 = t3 { optimised code } t1 = id3 * 60.0 id1 = id2 + t1
-
There is a great variation in the amount of code optimisation that different compilers perform
-
In the compilers that do the most optimisation (i.e. “optimising compilers”) a significant amount of time is spent on this phase.
-
On the other hand, there are simple optimisations that can significantly improve the running time of the target program without slowing down the compilation too much.
1.2.6 - Code Generation
Generate code in the target language based on the intermediate representation
- The code generator takes the intermediate representation of the source program and maps it into the target language
- If the target language is machine code, registers or memory locations are selected for each of the variables used by the program
- Then, intermediate instructions are translated into sequences of machine instructions that perform the same task
- Assigning registers to hold variables must be done carefully.
- We can translate the optimised three-address code from before into the following machine code:
{ optimised three-address code}
t1 = id3 * 60.0
id1 = id2 + t1
{ the corresponding machine code}
LDF R2, id3
MULF R2, R2, #60.0
LDF R1, id2
ADDF R1, R1, R2
STF id1, R1
LDF R2, id3
: Loads the contents of address id3 into register R2MULF R2, R2, #60.0
: Multiplies the value of R2 with the floating point constant 60- The preceding
#
indicates that the 60.0 is to be treated as an immediate constant.
- The preceding
LDF R1, id2
: Moves id2 into register R1ADDF R1, R1, R2
: Adds the value of id2 to the value stored in R2 (id3)STF id1, R1
: Value in register R1 is stored in address of d1
1.2.7 - Symbol Table
The symbol table is a data structure containing a record for each variable name, with fields for the attributes of the name. The data structure should be designed to allow the compiler to find the record for each name quickly and to store or retrieve data from that record quickly.
- An essential function of a compiler is to record the variable names used in the source program and collect information about various attributes of each name.
- These attributes may provide information -
- Info about the storage allocated for a name
- Its type
- Its scope (where in the program its value may be used)
- In the case of procedure names, information such as the number and types of its arguments, the method of passing each argument (by value, by reference), and type returned.
1.2.8 - Compiler Phases into Passes
Can group several compiler phases into a single pass.
- The front-end phases of a compiler (lexical analysis, syntax analysis, semantic analysis, intermediate code generation) might be grouped together into a single pass
- There may be an optional pass for code generation
- The back-end pass could consist of the code generation phase for a specific target machine.
- In this way, we could substitute:
- Different back-end passes for different target machines from the same source language
- Different front-end passes for different source languages to the same target machine
1.2.9 - Compiler Construction Tools
Tools (and their respective descriptions) that can be used to aide the construction of a compiler.
-
The compiler writer can profitably use modern software development environments containing tools such as language editors, debuggers, version managers, profilers, test harnesses etc.
-
In addition to these general software development tools, other more specialised tools have been created to help implement the various phases of a compiler.
-
These tools use specialised languages for specifying and implementing specific components, and many use quite sophisticated algorithms
- The most successful tools are those that hide the details of the generation algorithm and produce components that can be easily integrated into the remainder of the compiler.
-
Some commonly used compiler construction tools include:
- Parser Generators that automatically produce syntax analysers from a grammatical description of a programming language
- Scanner Generators that produce lexical analysers from regular expression description of the tokens of a language
- Syntax-Directed Translation Engines that produce collections of routines for walking a parse tree and generating intermediate code
- Code-Generator Generators that produce a code generator from a collection of rules for translating each operation of the intermediate language into the machine language for a target machine
- Data-Flow Analysis Engines that facilitate the gathering of information about how values are transmitted from one part of a program to each other part. Data flow analysis is a key part of code optimisation
- Compiler-Construction Toolkits that provide an integrated set of routes for construction various phases of a compiler.