Checkpoint 2

Room Expansion + Query Aggregation

C1 built a query engine to answer queries about UBC course sections. C2 will extend your C1 solution by adding another type of input data (rooms) and extending the query language (adding result computation). The input data will now include data about the physical spaces where classes are held on campus. The C1 query language was fairly simple, as you could not construct queries that would let you aggregate and compute values on the results of queries. C2 will expand the query engine to enable result computation (e.g., to figure out the average for a course or figure out the number of seats for a building).

As in C1, we will give the specification and you will need to design and build a software project that conforms to that specification. The specification is outlined below.

Change Log

If there are changes to this document, they will be listed here.

Room Dataset

This data has been obtained from the UBC Building and classrooms listing (although a few years ago). The data is provided as a zip file: inside of the zip you will find index.htm which specifies each building on campus. The links in the index.htm link to files also in the zip containing details about each building and its rooms in HTML format.

The dataset file can be found here: rooms.zip. Once unzipped, the directory structure should look like this:

.├── campus/│ └── discover/│ └── buildings-and-classrooms/│ ├── AAC.htm│ ├── ACEN.htm│ ├── ACU.htm│ └── ...└── index.htm

Reading and Parsing the Dataset

As with C1, you will need to parse valid rooms source files into internal objects or other data structures. You must also write a copy of the data structure to disk, and should be able to load these files to be queried if necessary. These files should be saved to the <PROJECT_DIR>/data directory as before. Make sure not to commit these parsed files to version control, as this may cause unpredicted test failures. In a valid use of addDataset, a rooms zip will be passed as content along with the kind as InsightDatasetKind.Rooms.

There is a provided package called parse5 that you should use to parse the HTML files into a more convenient-to-traverse JSON format (you should only need the parse method). Parse5 also has an online playground where you can visualize the structure of a Document, which is the output of a parsed HTML file. You must traverse this document in order to extract the buildings/rooms information.

There are many ways to structure an HTML file to display the same information. It is important in your parsing to not hard code the parsing of the HTML tree. Instead, focus on searching the document tree for nodes with that matches the required specification. For example, there can be many <table/> elements within the index.htm file, so the code should search for all <table/>s and find the one that satisfies the specification (ie. has valid building/rooms data). Ultimately, if you find yourself looking for Document nodes based on some hardcoded positions (eg. children[0].children[1].children[0].text), you'll want to change your approach!

Checking the validity of the dataset

A valid dataset:

Has to be a valid zip file
There is a single index.htm file per dataset in the root of the zip, and many building files in campus/discover/buildings-and-classrooms directory.
- Valid building file will always be in HTML format.
A valid room must contain every field which can be used by a rooms query (see Valid Query Keys).
- Missing/meaningless values (eg. empty string) found in valid HTML elements are okay (as long as the element can be found).
If a building HTML contains no rooms at all, it can be ignored (ie. you don't need to keep building-level information).
If a requesting a building's geolocation results in an error, skip over it.
A valid dataset has to contain at least one valid room that meets the requirements above.

Additionally, in valid zips the following will always be true:

You should only parse buildings that are linked to from the index.htm file. There may be more building files in the zip, but they should be ignored.
All <td/>s associated with a rooms field will have class attributes present in the same forms as in the provided rooms.zip file.

For example, if you see <td class = "room-data">target</td> present in rooms.zip, the <td/> with class "room-data" will always be present for target in any valid rooms kind dataset.

Finding the geolocation of each building

In addition to parsing the HTML files, you must transform a buildings' addresses to a latitude/longitude pair. This is usually performed using online web services. But, to avoid problems with us spamming different geolocation providers, we will be providing a web service for you to use for this purpose. To obtain the geolocation of an address, you must send a GET request to:

http://cs310.students.cs.ubc.ca:11316/api/v1/project_team<TEAM NUMBER>/<ADDRESS>

Where ADDRESS should be the URL-encoded version of an address (e.g., 6245 Agronomy Road V6T 1Z4 should be represented as 6245%20Agronomy%20Road%20V6T%201Z4). Addresses should be given exactly as they appear in the data files, or 404 will be returned.

To send these requests, you must use the http package. If you try to add to your project any other third party packages AutoTest will fail in unpredictable ways.

The response will match the following interface (either you will get lat & lon, or error, but never both):

interface GeoResponse {

lat?: number;

lon?: number;

error?: string;

}

Since we are hosting this service it could be killed by DOS attacks, please try not to overload the service. You should only need to query this when you are processing the initial dataset zips, not when you are answering queries.

Query Engine

Regarding the query engine, the primary objective of this checkpoint is two-fold:

(i) extend the query language to accommodate queries to a new dataset, i.e. Rooms; and

(ii) enable more comprehensive queries about the datasets, i.e. aggregate results.

At a high level, the new query functionalities added are:

- GROUP: Group the list of results into sets by some matching criteria.
- APPLY: Perform calculations across a set of results (ie. across a GROUP).
  - - MAX: Find the maximum value of a field. For numeric fields only.
    - MIN: Find the minimum value of a field. For numeric fields only.
    - AVG: Find the average value of a field. For numeric fields only.
    - SUM: Find the sum of a field. For numeric fields only.
    - COUNT: Count the number of unique occurrences of a field. For both numeric and string fields.
- SORT: Order results on one or more columns.
  - You can sort by a single column as in C1, e.g., "ORDER": "sections_avg"; or
  - You can sort by multiple columns on either ascending/descending order by specifying these options in an object (see example query below)
    - "dir": "UP": Sort results ascending.
    - "dir": "DOWN": Sort results descending.
    - "keys": ["sections_avg"]: sorts by a single key
    - "keys": ["sections_year", "sections_avg"]: sorts by multiple keys. In this example the course average should be used to resolve ties for courses in the same year

EBNF

QUERY ::='{'BODY ', ' OPTIONS (', ' TRANSFORMATIONS)? '}'

BODY ::= 'WHERE:{' (FILTER)? '}'

OPTIONS ::= 'OPTIONS:{' COLUMNS (', ' SORT)? '}'

TRANSFORMATIONS ::= 'TRANSFORMATIONS: {' GROUP ', ' APPLY '}'

FILTER ::= LOGICCOMPARISON | MCOMPARISON | SCOMPARISON | NEGATION

LOGICCOMPARISON ::= LOGIC ':[{' FILTER ('}, {' FILTER )* '}]'

MCOMPARISON ::= MCOMPARATOR ':{' mkey ':' number '}'

SCOMPARISON ::= 'IS:{' skey ':' [*]? inputstring [*]? '}' // Asterisks should act as wildcards. Optional.

NEGATION ::= 'NOT :{' FILTER '}'

LOGIC ::= 'AND' | 'OR'

MCOMPARATOR ::= 'LT' | 'GT' | 'EQ'

COLUMNS ::= 'COLUMNS:[' ANYKEY (',' ANYKEY)* ']'

SORT ::= 'ORDER: ' ('{ dir:' DIRECTION ', keys: [ ' ANYKEY (',' ANYKEY)* ']}') | ANYKEY

DIRECTION ::= 'UP' | 'DOWN'

ANYKEY ::= key | applykey

GROUP ::= 'GROUP: [' (key ',')* key ']'

APPLY ::= 'APPLY: [' (APPLYRULE (', ' APPLYRULE )* )? ']'

APPLYRULE ::= '{' applykey ': {' APPLYTOKEN ':' key '}}'

APPLYTOKEN ::= 'MAX' | 'MIN' | 'AVG' | 'COUNT' | 'SUM'

key ::= mkey | skey

mkey ::= idstring '_' mfield

skey ::= idstring '_' sfield

idstring ::= [^_]+ // One or more of any character, except underscore.

inputstring ::= [^*]* // zero or more of any character except asterisk.

applykey ::= [^_]+ // one or more of any character except underscore.

Syntactic Checking (Parsing)

Similar to Checkpoint 1, you must ensure that a query conforms to the above grammar, and reject it if it does not.

Semantic Checking

On top of the semantic checking from Checkpoint 1, you must perform the following semantic check:

The applykey in an APPLYRULE should be unique (no two APPLYRULEs should share an applykey with the same name).
If a GROUP is present, all COLUMNS keys must correspond to one of the GROUP keys or to applykeys defined in the APPLY block.
All SORT keys must also be in the COLUMNS.
MAX/MIN/AVG/SUM should only be requested for numeric keys. COUNT can be requested for all keys.
A valid query will not contain keys from more than one dataset kind (i.e. only sections_xx keys or only rooms_xx keys, never a combination).

If any of these qualifications are not met, the query is invalid and is rejected.

Other JS/Typescript requirements

Sorting should be according to the < operator in TypeScript/JavaScript, not by localeCompare.
AVG should return a number rounded to two decimal places. Supporting AVG requires some extra care compared to the other operators. Since JavaScript numbers are represented by floating point numbers, performing this arithmetic can return different values depending on the order the operations take place. To account for this, you must use the Decimal package (already included in your package.json), and follow these steps exactly:
1. 1. Convert your each value to Decimal (e.g., new Decimal(num)).
  2. Add the numbers being averaged using Decimal's .add() method (e.g., building up a variable called total).
  3. Calculate the average (let avg = total.toNumber() / numRows). numRows should not be converted to Decimal.
  4. Round the average to the second decimal digit with toFixed(2) and cast the result back to number type (let res = Number(avg.toFixed(2)))1.
SUM should return a number rounded to two decimal places using Number(sum.toFixed(2))1.
COUNT should return whole numbers.
MIN/MAX should return the same number that is in the originating dataset.

1. Once cast to a number, you may appear to "lose" decimal places, for instance Number("2.00") will display as 2. This is okay.

Valid Query Keys

In addition to the valid sections keys from Checkpoint 1, this checkpoint adds a variety of new rooms keys.

If the dataset id sent by the user is "rooms", then the queries you will run against this dataset must use the following keys:

rooms_fullname: string; Full building name (e.g., "Hugh Dempster Pavilion").
rooms_shortname: string; Short building name (e.g., "DMP").
rooms_number: string; The room number. Not always a number, so represented as a string.
rooms_name: string; The room id; should be rooms_shortname+"_"+rooms_number.
rooms_address: string; The building address. (e.g., "6245 Agronomy Road V6T 1Z4").
rooms_lat: number; The latitude of the building, as received via HTTP request.
rooms_lon: number; The longitude of the building, as received via HTTP request.
rooms_seats: number; The number of seats in the room. The default value for this field is 0.
rooms_type: string; The room type (e.g., "Small Group").
rooms_furniture: string; The room furniture (e.g., "Classroom-Movable Tables & Chairs").
rooms_href: string; The link to full details online (e.g., "http://students.ubc.ca/campus/discover/buildings-and-classrooms/room/DMP-201").

Aggregation (A Step by Step Example)

First, note that WHERE is completely independent of GROUP/APPLY. WHERE filtering happens first, then GROUP/APPLY are performed on those filtered results.

GROUP: [term1, term2, ...] signifies that a group should be created for every unique set of all N-terms. For example, GROUP: [sections_dept, sections_id] would create a group for every unique (department, id) pair in the sections dataset. Every member of a group will always have the same values for each key in the GROUP array (e.g. in the previous example, all members of a group would share the same values for sections_dept and sections_id).

As an example, suppose we have the following courses dataset (for the sake of simplicity, some keys are omitted):

[

{ "sections_uuid": "1", "sections_instructor": "Jean", "sections_avg": 90, "sections_title" : "310"},

{ "sections_uuid": "2", "sections_instructor": "Jean", "sections_avg": 80, "sections_title" : "310"},

{ "sections_uuid": "3", "sections_instructor": "Casey", "sections_avg": 95, "sections_title" : "310"},

{ "sections_uuid": "4", "sections_instructor": "Casey", "sections_avg": 85, "sections_title" : "310"},

{ "sections_uuid": "5", "sections_instructor": "Kelly", "sections_avg": 74, "sections_title" : "210"},

{ "sections_uuid": "6", "sections_instructor": "Kelly", "sections_avg": 78, "sections_title" : "210"},

{ "sections_uuid": "7", "sections_instructor": "Kelly", "sections_avg": 72, "sections_title" : "210"},

{ "sections_uuid": "8", "sections_instructor": "Eli", "sections_avg": 85, "sections_title" : "210"}

]

We want to query the dataset to aggregate sections by their title and obtain their average. Our aggregation query would look like this:

{

"WHERE": {},

"OPTIONS": {

"COLUMNS": ["sections_title", "overallAvg"]

"TRANSFORMATIONS": {

"GROUP": ["sections_title"],

"APPLY": [{

"overallAvg": {

"AVG": "sections_avg"

}

}]

}

For this query, there are two groups: one that matches "sections_title" = "310" and other that matches "210". At some point you will likely need to have an intermediate data structure to create/hold on your groups, use whatever structure that feels natural to you. Continuing with our example, we have these groups:

310 group =

[

{ "sections_uuid": "1", "sections_instructor": "Jean", "sections_avg": 90, "sections_title" : "310"},

{ "sections_uuid": "2", "sections_instructor": "Jean", "sections_avg": 80, "sections_title" : "310"},

{ "sections_uuid": "3", "sections_instructor": "Casey", "sections_avg": 95, "sections_title" : "310"},

{ "sections_uuid": "4", "sections_instructor": "Casey", "sections_avg": 85, "sections_title" : "310"}

]

210 group =

[

{ "sections_uuid": "5", "sections_instructor": "Kelly", "sections_avg": 74, "sections_title" : "210"},

{ "sections_uuid": "6", "sections_instructor": "Kelly", "sections_avg": 78, "sections_title" : "210"},

{ "sections_uuid": "7", "sections_instructor": "Kelly", "sections_avg": 72, "sections_title" : "210"},

{ "sections_uuid": "8", "sections_instructor": "Eli", "sections_avg": 85, "sections_title" : "210"}

]

The last step is fairly simple, we execute the apply operation in each group. The average of "310" group is (90 + 80 + 95 + 85)/4 = 87.5 whereas for the "210" group the average is 77.25. Our final result for the previous query would be:

[

{ "sections_title" : "310", "overallAvg": 87.5},

{ "sections_title" : "210", "overallAvg": 77.25}

]

Notice that we can have more elaborate groups such as discovering if a specific instructor of a section has a better average than other instructors (i.e.,"GROUP": ["sections_instructor", "sections_title"]). In that case, we would have four groups: (310, Jean), (310, Casey), (210 , Kelly), and (210, Eli).

Rooms Query Example

{

"WHERE": {

"AND": [{

"IS": {

"rooms_furniture": "*Tables*"

}

}, {

"GT": {

"rooms_seats": 300

}

}]

"OPTIONS": {

"COLUMNS": [

"rooms_shortname",

"maxSeats"

"ORDER": {

"dir": "DOWN",

"keys": ["maxSeats"]

}

"TRANSFORMATIONS": {

"GROUP": ["rooms_shortname"],

"APPLY": [{

"maxSeats": {

"MAX": "rooms_seats"

}

}]

}

Response:

{

"result": [{

"rooms_shortname": "OSBO",

"maxSeats": 442

}, {

"rooms_shortname": "HEBB",

"maxSeats": 375

}, {

"rooms_shortname": "LSC",

"maxSeats": 350

}]

}

API

There are no changes in the API for this checkpoint, it is the same as the one in Checkpoint 1.

Testing

There are no changes in the testing instructions, they are the same as in Checkpoint 1. Ultimately, what tests your implementation for the additional features are your own local tests. We highly recommend you to follow the practices from C0/C1, read the spec, write tests, and implement!

For C2, request smoke test feedback with @310-bot #c2 on the main branch. @310-bot #c0 now also supports C2 specific tests. @310-bot #c1 continues to work for C1 smoke tests.

ESLint

Moving forward with the project, we will be using an updated ESLint rules.

Important note:

310-bot will create a pull request in your repos during the first week of this sprint. It will contain some changes to your project regarding linting. Carefully look at this pull request and merge the PR when you understand what 310-bot wants to merge into your projects.

There are two new major lint rules that will be used.

max-nested-callbacks

Many JavaScript libraries use the callback pattern to manage asynchronous operations. A program of any complexity will most likely need to manage several asynchronous operations at various levels of concurrency. A common pitfall that is easy to fall into is nesting callbacks, which makes code more difficult to read the deeper the callbacks are nested.

This rule enforces a maximum depth that callbacks can be nested to increase code clarity.

Rationale: The nested callback anti-pattern exposed by creating Promises inside Promises hampers both readability and maintainability. In addition, nested callbacks obscure the traceability of a program's control flow. This rule encourages better stylistic practice like Promise chaining, and frees up horizontal screen real estate by decreasing indentation.

max-statements-per-line

A line of code containing too many statements can be difficult to read. Code is generally read from the top down, especially when scanning, so limiting the number of statements allowed on a single line can be very beneficial for readability and maintainability.

This rule enforces a maximum number of statements allowed per line.

Rationale: This lint rule discourages cramming too much different behaviour onto a single line, or taking the "quick fix" approach to dealing with a line length lint rule by making short methods names so they can be compacted onto one line. Descriptive names and readable lines are important and shouldn't be sacrificed for compactness.

Additionally, there is one minor stylistic rule change:

lines-between-class-members:always

Getting Started

There is no best way to get started, but you can consider each of these in turn. Some possible options that could be pursued in any order (or skipped entirely):

Start by looking at the data file we have provided and understanding what kind of data you will be analyzing and manipulating. It is crucial to understand that index.htm and the other files have different structures. You will need to extract different, though complementary information, from each one of them.
Ignoring the rest of the dataset parsing, consider writing a method to get a buildings' address along with tests for this helper method.
Ignoring the provided data, create a fake dataset with few entries. Write the portion of the system that would perform the GROUP and APPLY operations on this small dataset.

Trying to keep all of the requirements in mind at once can be overwhelming. Tackling a single task that you can accomplish in an hour is going to be much more effective than worrying about the whole specification at once. Iteratively growing your project from small task to small task is going to be the best way to make forward progress.

Grading

The #c2 grading rubric is given on the project grading page.

Page updated

Google Sites

Report abuse