Project 1: Scanner and Parser

In this project, you will extend tipc with a scanner and parser for ILOC code. This is the only project that will break backwards-compatability: you will not be able to read in Project 0 files after finishing this project. Thus, be sure to tag your Project 0 submission with p0 before committing anything for this project.

tipc will receive as input an ILOC file, and print either the stream of tokens, the intermediate representation, or ILOC code depending on the flags. You will perform some basic context-free sanity-checking of the ILOC code. There are two components: a scanner and a parser. The scanner will read in the text file and produce a stream of tokens. The parser will take the stream of tokens and create an intermediate representation of the code.

Before submitting your program, test it thoroughly on all provided blocks.

Due Dates:

Checkpoint - Scanner and token printer due Thursday 9/7
Project - Parser and printer due Thursday 9/14

Scanner

Because ILOC is exceedingly simple to parse, a significant focus of this projects on the scanner. You MUST hand-code your scanner, reading a single character at a time. You MUST NOT use regular expressions, scanf, or other programming constructs that process more than one character at a time. The hand-coded approach from Chapter 2.5 in Engineering a Compiler is suggested, but the table-driven approach may be used (although this is ambitious).

Input: ILOC Subset

Your input is a single basic block of code consisting of instructions from the following subset of ILOC. All letters are case-sensitive. The following figure describes the syntax and semantics of ILOC instructions.

Syntax Meaning Latency
load r1 => r2 r2 ← MEM (r1) 2
loadI x => r2 r2 ← x 1
store r1 => r2 MEM(r2) ← r1 2
add r1, r2 => r3 r3 ← r1 + r2 1
sub r1, r2 => r3 r3 ← r1 - r2 1
mult r1, r2 => r3 r3 ← r1 * r2 1
lshift r1, r2 => r3 r3 ← r1 ≪ r2 1
rshift r1, r2 => r3 r3 ← r1 ≫ r2 1
output x prints MEM(x) to stdout 1
nop idle for one cycle 1

All register names have an initial r followed immediately by a non-negative integer. Leading zeros in the register name are not significant; thus, r017 and r17 MUST refer to the same register. Arguments that do not begin with r, which appear as x in the table above, are assumed to be positive integer constants in the range 0 to 231 -1.

All ILOC operations in input blocks must begin on a new line. The code will contain commas and the assignment arrow =>. Spaces and tabs are treated as whitespace. ILOC opcodes MUST be followed by whitespace, which MAY be a newline in the case of nop. Whitespace preceding and following all other symbols is optional. Whitespace MUST NOT occur within operation names, register names, or the assignment symbol. All other whitespace MUST be ignored. A double slash (//) indicates that the rest of the line is a comment and MUST be discarded. Empty lines MUST also be discarded.

If you are curious, Appendix A of Engineering a Compiler provides details on these instructions, with the exception of the output operation. The output operation prints a value from memory to stdout.

Scanner Implementation and Output

The scanner MUST turn the input into a stream of tokens. You may either represent the stream explicitly, as collection of tokens; or implicitly, using functions that return the next token. You MUST implement code to lex an arbitrary token, but you MAY find it helpful to also have specific functions that lex a token of each category.

Each token must have a token category and, when appropriate, a value. There are five categories of tokens in ILOC: instructions, registers, constants, commas, and arrows. The state diagrams for instructions and registers are included in the Appendix. Token categories, instruction opcodes, and any other non-numeric values MUST be represented with an enumerated type. You MUST NOT use strings or raw integers. You SHOULD convert integer lexemes into integer values upon scanning. As described below, if the -l flag is used, you MUST output each token on a line. Include both the category of the token, as well as either it's value or lexeme.

Parser

The parser MUST use the scanner to build an intermediate representation, as discussed below and in class. You MUST hand-code your parser, but because ILOC is simple to parse, you do not need to specifically implement an LL or LR parser. In fact, we will likely not even have covered that topic in class. The first token of every line will be an opcode, which fully specifies the remainder of the line. Each line MUST be encoded as a single entry in the intermediate representation. This greatly simplifies the task of parsing. Project 3 delves into the details of parsing, the complexity here lies in the intermediate representation.

Intermediate Representation

Parsing alone will simply tell us if the input is well-formed. Working on the code requires a concrete structure: the intermediate representation (IR). Projects 2 and 4 will build on the parser, manipulating the IR to perform register allocation and instruction scheduling. Thus you SHOULD design your IR to support these projects. Some of the requirements are outlined below, but many will not be revealed until you are enmeshed in the details of the project. This means that extensibility is a central design goal.

The intermediate representation is a standardized way to represent ILOC code. You MUST use a linear IR with three-address codes, similar to Chapter 5.3.2 in Engineering a Compiler. This IR is a sequence of instructions represented in a uniform way. Future projects will require you to insert and reorder instructions, so you SHOULD ensure the data structure used to represent the sequence supports these operations efficiently.

Each instruction can be represented as quadruple of the opcode, two source operands, and one destination operand. Some instructions lack two source operands or a destination operand, but MUST still be represented in a consistent manner. Future project will require the ability to annotate both instructions and the operands. Using classes or structs will simplify extending the IR with these annotations. By contrast, using arrays makes adding new fields error-prone. Consider a single class, struct, or type to represent an single instruction, and a sub-type to represent operands.

Take special care with the store operation: store r1 => r2. This operation takes the value in r1, and place it into the memory address specified by r2. Consequently, the value in r2 is read, not overwritten like every other destination operand. You SHOULD treat r2 as a source operand in your intermediate representation. You MUST print the store instruction correctly.

Project Details

Interface

Your code MUST compile on the janus machines, using a Makefile in the project directory, to an executable named tipc in that directory. Your program MUST implement the following command line arguments when specified by the short flags. You SHOULD support the long flags as well. You SHOULD use a library, such as getopt, to handle these flags. Your default output for this version SHOULD be the -t flag:.

-l --lexer Print a list of the tokens, one per line. You should print both the token type and the specific lexeme.
-p --pretty-print Print legal ILOC code. You do not have to preserve white-space or comments.
-t --table-print Print the intermediate representation in tabular form.
-h --help Print a help summary and exit.
-d --debug Print debugging information during execution. All debugging information SHOULD be printed as a comment.

Unless the -h flag is used, the final argument MUST be filename, an input file containing an ILOC block. For the tabular output, you SHOULD format the output into well-spaced columns, using separators such as | between elements of your IR. All outputs MUST be printed to stdout. Error messages for invalid arguments or an invalid file SHOULD be reasonable and MUST be printed to stderr. You MAY forbid the combination of -l with other print flags.

You may include other command-line arguments or flags for your program, but SHOULD leave the following flags free: -a, -k, and -s.

Infrastructure

The testing repository has a directory blocks that contains a a number of sample blocks. Each block has a comment with a sample input and expected output when simulated (see below). It also contains a folder p1/outputs/ that contains sample outputs for block01.i.

On the janus machines, a few tools are available in the directory /users/sfogarty/tools_csci3334/p1

reader_ref: A reference reader you can compare your code with.

sim: A simulator. Running ./sim -h will display details, but the usual invocation is ./sim < file or ./tipc -p file | sim.
The -i nums flag lets you initialize memory. The first number is the starting location, the every number after that populates 4 bytes. Use this to verify your -p output.

Grading Rubric

The project is graded as follows:

35 points: handed-coded scanner.
15 points: correctness of -l output.
15 points: correctness of -t output.
20 points: -p output produces correct, runnable code.
5 points: default outputs match samples under diff -w.
10 points: completed all SHOULD requirements.

Appendix: State Diagrams

This appendix contains state diagrams that accept opcodes and registers. These may be of assistance when implementing your scanner.

Report abuse