Component 1: Lexer and Parser

In this component , you will begin your partial compiler, thc, with a lexer, parser, and pretty-printer for ILOC code.

thc will receive as input an ILOC file, and print either the stream of tokens, the intermediate representation, or ILOC code depending on the flags. You will perform some basic context-free sanity-checking of the ILOC code. There are two components: a lexer and a parser. The lexer will read in the text file and produce a stream of tokens. The parser will take the stream of tokens and create an intermediate representation of the code.

Before submitting your program, test it thoroughly on all provided blocks.

Lexer

Because ILOC is exceedingly simple to parse, a significant focus of this component is on the lexer, also called a scanner. You MUST hand-code your lexer, reading a single character at a time. You MUST NOT use regular expressions, scanf, or other programming constructs that process more than one character at a time. The hand-coded approach from Chapter 2.5 in Engineering a Compiler is suggested, but the table-driven approach may be used (although this is ambitious).

Input: ILOC Subset

Your input is a single basic block of code consisting of instructions from the following subset of ILOC. All letters are case-sensitive. The following figure describes the syntax and semantics of ILOC instructions.

Syntax Meaning Latency
load r1 => r2 r2 ← MEM (r1) 2
loadI x => r2 r2 ← x 1
store r1 => r2 MEM(r2) ← r1 2
add r1, r2 => r3 r3 ← r1 + r2 1
sub r1, r2 => r3 r3 ← r1 - r2 1
mult r1, r2 => r3 r3 ← r1 * r2 1
lshift r1, r2 => r3 r3 ← r1 ≪ r2 1
rshift r1, r2 => r3 r3 ← r1 ≫ r2 1
output x prints MEM(x) to stdout 1
nop idle for one cycle 1

All register names have an initial r followed immediately by a non-negative integer. Leading zeros in the register name are not significant; thus, r017 and r17 MUST refer to the same register. Arguments that do not begin with r, which appear as x in the table above, are assumed to be positive integer constants in the range 0 to 231 -1.

All ILOC operations in input blocks must begin on a new line. The code will contain commas and the assignment arrow =>. Spaces and tabs are treated as whitespace. ILOC opcodes MUST be followed by whitespace, which MAY be a newline in the case of nop. Whitespace preceding and following all other symbols is optional. Whitespace MUST NOT occur within operation names, register names, or the assignment symbol. All other whitespace MUST be ignored. A double slash (//) indicates that the rest of the line is a comment and MUST be discarded. Empty lines MUST also be discarded.

If you are curious, Appendix A of Engineering a Compiler provides details on these instructions, with the exception of the output operation. The output operation prints a value from memory to stdout.

Lexer Implementation and Output

The lexer MUST turn the input into a stream of tokens. You MAY either represent the stream explicitly, as collection of tokens; or implicitly, using functions that return the next token. You MUST implement code to lex an arbitrary token, but you MAY find it helpful to also have specific functions that lex a token of each category.

Each token must have a token category and, when appropriate, a value. There are five categories of tokens in ILOC: instructions, registers, constants, commas, and arrows. The state diagrams for instructions and registers are included in the Appendix. Token categories, instruction opcodes, and any other non-numeric values MUST be represented with an enumerated type. You MUST NOT use strings or raw integers. You SHOULD convert integer lexemes into integer values upon scanning. As described below, if the -l flag is used, you MUST output each token on a line. Include both the category of the token, as well as either it's value or lexeme.

Parser

The parser MUST use the lexer to build an intermediate representation, as discussed below and in class. You MUST hand-code your parser, but because ILOC is simple to parse, you do not need to specifically implement an LL or LR parser. In fact, we will likely not even have covered that topic in class. The first token of every line will be an opcode, which fully specifies the remainder of the line. Each line MUST be encoded as a single entry in the intermediate representation. This greatly simplifies the task of parsing. Lab 2 delves into the details of parsing, the complexity here lies in the intermediate representation.

Intermediate Representation

Parsing alone will simply tell us if the input is well-formed. Working on the code requires a concrete structure: the intermediate representation (IR). Future components will build on the parser, manipulating the IR to perform register allocation and instruction scheduling. Thus you SHOULD design your IR to support these components. Some of the requirements are outlined below, but many will not be revealed until you are enmeshed in the details of the project. This means that extensibility is a central design goal.

The intermediate representation is a standardized way to represent ILOC code. You MUST use a linear IR: a sequence of instructions represented in a uniform way. Future components will require you to insert and reorder instructions, so you SHOULD ensure the data structure used to represent the sequence supports these operations efficiently.

To represent each instruction, you MUST with use three-address codes, similar to Chapter 5.3.2 in Engineering a Compiler. This represents each instruction as quadruple of the opcode, two source operands, and one destination operand. Some instructions lack two source operands or a destination operand, but MUST still be represented in a consistent manner. Future components will require the ability to annotate both instructions and the operands. Using structs will simplify extending the IR with these annotations. By contrast, using arrays makes adding new fields error-prone. You SHOULD use a single struct or type to represent an single instruction, and a sub-type to represent operands.

Take special care with the store operation: store r1 => r2. This operation takes the value in r1, and place it into the memory address specified by r2. Consequently, the value in r2 is read, not overwritten like every other destination operand. You SHOULD treat r2 as a source operand in your intermediate representation. You MUST print the store instruction correctly.

Component Details

Interface

Your code MUST compile on the janus machines, using a Makefile in the project directory, to an executable named thc in that directory. Your program MUST implement the following command line arguments when specified by the short flags. You SHOULD support the long flags as well. You SHOULD use a library, such as getopt, to handle these flags. Your default output for this version SHOULD be the -t flag:.

-l --lexer Print a list of the tokens, including category and lexeme, one token per line.
-p --pretty-print Print legal ILOC code. You do not have to preserve white-space or comments.
-t --table-print Print the intermediate representation in tabular form.
-h --help Print a help summary and exit.
-d n --debug =n Print debugging information during execution. All debugging information SHOULD be printed as a comment (//).

Unless the -h flag is used, the final argument MUST be filename, an input file containing an ILOC block. For the tabular output, you SHOULD format the output into well-spaced columns, using separators such as | between elements of your IR. All outputs MUST be printed to stdout. Error messages for invalid arguments or an invalid file SHOULD be reasonable and MUST be printed to stderr. You MAY forbid the combination of -l with other print flags.

You may include other command-line arguments or flags for your program, but SHOULD leave the following flags free: -a, -k, and -s.

Infrastructure

The workspace repository has a directory blocks that contains a a number of sample blocks. Each block has a comment with a sample input and expected output when simulated (see below). It also contains a folder thcFrontEnd with the following useful elements.

The workspace also includes a few tools in the thcFrontEnd directory

outputs/ that contains sample outputs for block0x.i.
reader_ref: A reference reader you can compare your code with.

sim: A simulator. Running ./sim -h will display details, but the usual invocation is ./sim < file or ./thc -p file | sim.
The -i nums flag lets you initialize memory. The first number is the starting location, the every number after that populates 4 bytes. Use this to verify your -p output.
scripts/testReaderIloc and scripts/genReaderOutputs that help to generate and/or test outputs for an entire folder of blocks.

Appendix: State Diagrams

This appendix contains state diagrams that accept opcodes and registers. These may be of assistance when implementing your scanner.

Report abuse