Checkpoint 2

Learning Outcomes

Able to introduce refactoring to code to handle a new specification.
Can understand the basic structure of an HTML document, and how to traverse it programmatically.
Can make API calls to external services using http.

Change Log

Oct 26: Bot updated to detect references from src/ to test/ which can cause unexpected behaviours when private tests are running. This usually only affects TestUtil (specifically the reference to the cache dir). Just copy this field to src/ and you should be fine.

1. Introduction

In the last checkpoint, you built a Data Processor to manage datasets and a Query Engine to handle queries on those datasets. In this checkpoint, you will extend both the Data Processor and the Query Engine you built previously. In this checkpoint, the Data Processor will need to accept another type of input data, rooms, in the form of HTML files. The input data will include information about the physical spaces where classes are held on campus.

The Query Engine will be extended to enable aggregation and with more ways of ordering results. For example, the new query language will be able to answer questions like "What is the average of this course?" by averaging all course section averages, or "How many seats are in this building?" by summing the seats in every room in a building.

2. Grading

We will check for copied code on all submissions, so please make sure your work is your own.

The grade for this checkpoint will be calculated identically to Checkpoint 1.

Your grade for this checkpoint will be calculated by running a private Client Test Suite against your implementation and your grade will be calculated as:

(number of passing tests) / (total number of tests)

For example, if our test suite has 10 tests and when we run them against your implementation 8 pass, your grade is 80% (8/10).

2.1 Teamwork Score

Like C1, the Teamwork score is comprised of scrum attendance and weekly reports. The teamwork score is a part of your overall project grade. More details can be found on the Project Grading page.

2.2 Requirements

You cannot use any library package that is not already specified in this document or required for TypeScript compilation (i.e., types).
Your implementation must be in TypeScript.
You are not allowed to store the data in any external database, only disk storage is permitted.
Do not store your datasets as static or global variables, keep them as members of a class. The reason for this is that we try to wipe added datasets in between tests. This won't work if your data is stored statically, and one test behaving incorrectly may cause future tests to fail as well. This will cause you to get a lower grade.

2.3 Submitting Your Work

As with C1, we will grade every commit on the main branch of your repository. Your grade will be the maximum grade you received from all commits on the main branch made before the checkpoint deadline.

2.4 AutoTest Feedback

This section is identical to the prior checkpoint except for the new test clusters relevant to the updated specification. The end of this section includes a screenshot of the new clusters.

Please refer to Checkpoint 1 for details on the types of AutoTest feedback. In summary:

You can request the Smoke Tests results only on the main branch. You request the feedback by creating a comment with @310-bot #c2 on a commit. You can only receive the feedback once every 12 hours per person.
AutoTest will report any build, lint, or timeout failures automatically on your feature branches.
The #check command can be used to evaluate your test suite. You can only receive this feedback once every 6 hours per person.
@310-bot #c1 will continue to work for C1 smoke tests.

Updated Smoke Test Clusters.

3. insightUBC Rooms Specification

You will extend InsightUBC to provide a way for users to manage their room data (add, list and remove room datasets) and to query the room data for insights. Also, you will extend the query language to handle aggregation computations and new sorting options; these will provide additional insights about both rooms and courses.

All of the changes in this checkpoint are to the addDataset and performQuery endpoints and are described below. The other endpoints (listDatasets and removeDataset) are unchanged from the prior Checkpoint.

3.1 API

The API remains unchanged from the insightUBC Section Specification.

Very important: do not alter the given API (IInsightFacade.ts interface) in any way, as it is used to grade your project!

3.2 Adding a rooms dataset

The same addDataset method defined in the IInsightFacade.ts interface file is used to add a rooms kind of dataset.

Valid ID argument to addDataset

The definition of a valid ID has not changed. Refer to the insightUBC Section Specification.

Valid Content argument to addDataset

Same as for sections, the content parameter is a zip file, in the format of a base64 string. All the data you need is contained within this zip file. You should use the JSZip module to unzip, navigate through, and view the files inside this file.

However unlike sections, the rooms data is contained within HTML files (.htm), not JSON (.json). Also, unlike the sections database, the information for a single room is spread between two files: the index.htm file (which contains a room's building information) and a building HTML file like AAC.htm (which contains the room information). A single building can have multiple rooms.

At the root of the zip file is the index.htm file. The index.htm file contains a table with building information, where one column of that table contains a link (file path) to each building's file. Below is an example of the file structure of a rooms dataset where the building files are found within campus/discover/buildings-and-classrooms/. This is the same file structure as the given room database: campus.zip.

├── campus/│ └── discover/│ └── buildings-and-classrooms/│ ├── AAC.htm│ ├── ACEN.htm│ ├── ACU.htm│ └── ...

└── index.htm

Valid Dataset Definition

A valid dataset:

Is a zip file.
Contains at least one valid room.

A valid index.htm file:

Is an HTML-formatted file. If index.htm exists, it is safe to assume that it will always be a well-formatted HTML file.
Contains a table that lists and links to building data files.
- The index.htm file can contain many tables, but only one will be the valid building list table. How to find the correct table is explained later in this section. The index.htm might also contain no table, in which case it would be invalid.
- Each row in the table represents a building and the row will contain a column that links to the building's data file within the zip. The building file might not exist or it could contain no valid rooms.
  An example link to a building file (ALRD.htm) looks like this:
  <a href="./campus/discover/buildings-and-classrooms/ALRD.htm" title="Building Details and Map">...</a>
  All building file links will contain link elements (<a>) in the href property.

A valid building file:

Is an HTML-formatted file. If the building file exists, it is safe to assume that it will always be a well-formatted HTML file.
Is linked from the index.htm file (as explained above).
Contains a table with valid rooms.
- The building file can contain many tables, but only one will be the valid room table. How to find the correct table is explained below this section.
- The building file might contain no rooms table or may contain a table with no valid rooms, in which case that building has no rooms.

A valid room:

Contains every field which can be used in a rooms query (see the Valid Query Keys section)
- Note: If a field is present in the HTML (ie. the <td> cell exists) but is empty or contains something counter-intuitive like an empty string, it is still valid.
The requested room's geolocation request returns successfully (i.e., there is no error). Geolocation is described below.

Finding the Building or Room Table

All the room data is contained within HTML tables. A building table within the index.htm file and a room table within the building's HTML file. An HTML table ( <table> element) contains rows (<tr> elements), and rows contain cells (<td> elements). HTML elements can have classes. Below is an example of an HTML table, which is a simplified version of the table found within the index.htm file:

<a href="./discover/buildings-and-classrooms/ACU.htm" title="Building Details and Map">Acute Care Unit</a>
</td>
<td class="views-field views-field-field-building-address">

2211 Wesbrook Mall

</td>
....
</tr>
</tbody>
</table>

In the above example, the first cell element (<td>) has two CSS classes (views-field and views-field-title). The second cell has the same class as the first (views-field) and a second, unique class (views-field-field-building-address).

The classes found on table cells (<td>) will be the same across all valid tables. For example, all valid index.htm tables will have the views-field and views-field-field-building-address classes on their address cells.

To find the table within an HTML file with the room information, you will need to look at the classes found on the <td> elements. As soon as you find one <td> element with a valid class, then you have found the room data table. Once the room data table has been found, it will need to be validated to ensure it contains all the required information.

To find valid classes, unzip the given room dataset, campus.zip, open the .htm files and look at the classes on the <td> elements.

As seen in the above example, the classes views-field and views-field-field-building-address can be used to find a room's building address.

Example Rooms Dataset

An example valid rooms kind dataset is the UBC Building and classrooms listing from a few years ago: campus.zip. To find the number of valid rooms inside the campus.zip, you will need to query it using the Reference UI (construct a query with no filter!).

Geolocation

For a building that contains a valid room, you will need to fetch the building's latitude and longitude.

This is usually performed using online web services. To avoid problems with us spamming external geolocation providers, we will be providing a web service for you to use for this purpose. To obtain the geolocation of an address, you must send a GET request to:

http://cs310.students.cs.ubc.ca:11316/api/v1/project_team<TEAM NUMBER>/<ADDRESS>

Where <ADDRESS> should be the URL-encoded version of an address (e.g., "6245 Agronomy Road V6T 1Z4" should be represented as 6245%20Agronomy%20Road%20V6T%201Z4). Addresses should be given exactly as they appear in the dataset files, or an HTTP 404 error code will be returned.

The response will match the following interface (either you will get lat & lon, or error, but never both):

interface GeoResponse {

lat?: number;

lon?: number;

error?: string;

}

Since we are hosting this service it could be killed by a DOS attack, so please try not to overload the service. You should only need to query this service when you are processing the initial dataset zips, not when you are answering queries.

Valid Kind argument to addDataset

When adding a rooms kind dataset, the dataset kind will be InsightFacade.Rooms. The InsightFacade.Sections kind is also valid but only when adding a sections dataset.

3.3 Querying the data for insights

Regarding the Query Engine, the primary objective of this checkpoint is two-fold:

Extend the query language to accommodate queries to a new dataset kind, i.e., Rooms; and
Enable more comprehensive queries about the datasets, i.e., aggregate results, directional sorts.

Valid Query argument to performQuery

A valid query is the same as before:

- Is based on the given EBNF (defined below)
- Only references one dataset (via the query keys).
- Has less than or equal to 5000 results. If this limit is exceeded the query should reject with a ResultTooLargeError

Query EBNF

At a high level, the new query functionalities added are:

- GROUP: Group the list of results into sets using some matching criteria.
- APPLY: Perform calculations across a set of results (i.e., across a GROUP).
- SORT: Order results by one or more columns.

QUERY ::='{' BODY ', ' OPTIONS '}' | '{' BODY ', ' OPTIONS ', ' TRANSFORMATIONS '}'

// Note: a BODY with no FILTER (i.e., WHERE:{}) matches all entries.

BODY ::= 'WHERE:{' FILTER? '}'

FILTER ::= LOGICCOMPARISON | MCOMPARISON | SCOMPARISON | NEGATION

LOGICCOMPARISON ::= LOGIC ':[' FILTER_LIST ']'

MCOMPARISON ::= MCOMPARATOR ':{' mkey ':' number '}'

SCOMPARISON ::= 'IS:{' skey ': "' [*]? inputstring [*]? '" }' // Asterisks at the beginning or end of the inputstring should act as wildcards.

NEGATION ::= 'NOT :{' FILTER '}'

FILTER_LIST ::= '{' FILTER '}' | '{' FILTER '}, ' FILTER_LIST // Comma separated list of filters containing at least one filter

LOGIC ::= 'AND' | 'OR'

MCOMPARATOR ::= 'LT' | 'GT' | 'EQ'

OPTIONS ::= 'OPTIONS:{' COLUMNS '}' | 'OPTIONS:{' COLUMNS ', ' SORT '}'

SORT ::= 'ORDER: { dir:' DIRECTION ', keys: [ ' ANYKEY_LIST '] }' | 'ORDER: ' ANYKEY
DIRECTION ::= 'UP' | 'DOWN'

TRANSFORMATIONS ::= 'TRANSFORMATIONS: {' GROUP ', ' APPLY '}'

GROUP ::= 'GROUP: [' KEY_LIST ']'
APPLY ::= 'APPLY: [' APPLYRULE_LIST? ']'
APPLYRULE_LIST ::= APPLYRULE | APPLYRULE ', ' APPLYRULE_LIST
APPLYRULE ::= '{' applykey ': {' APPLYTOKEN ':' KEY '} }'
APPLYTOKEN ::= 'MAX' | 'MIN' | 'AVG' | 'COUNT' | 'SUM'

COLUMNS ::= 'COLUMNS:[' ANYKEY_LIST ']'

// Comma-separated list of keys containing at least one key

KEY_LIST ::= KEY | KEY ', ' KEY_LIST
ANYKEY_LIST ::= ANYKEY | ANYKEY ', ' ANYKEY_LIST

ANYKEY ::= KEY | applykey

KEY ::= mkey | skey

mkey ::= '"' idstring '_' mfield '"'

skey ::= '"' idstring '_' sfield '"'

idstring ::= [^_]+ // One or more of any character, except underscore.

inputstring ::= [^*]* // Zero or more of any character, except asterisk.
applykey ::= [^_]+ // One or more of any character, except underscore.

Group + Apply

The query language now supports performing calculations across a group of results.

The types of calculations supported are:

- MAX: Find the maximum value of a field.
  - Returns the same number that is in the originating dataset.
- MIN: Find the minimum value of a field.
  - Returns the same number that is in the originating dataset.
- AVG: Find the average value of a field.
  - Returns a number rounded to two decimal places.
- SUM: Find the sum of a field.
  - Returns a number rounded to two decimal places.
- COUNT: Count the number of unique occurrences of a field.
  - Returns whole numbers.

Requirements:

MAX/MIN/AVG/SUM should only be requested for numeric keys. COUNT can be requested for all keys.
The applykey in an APPLYRULE should be unique, so no two APPLYRULEs should share an applykey with the same name.
If GROUP is present, all COLUMNS keys must correspond to one of the GROUP keys or to applykeys defined in the APPLY block.

Sort

The query language now supports sorting by:

A single column as in C1, e.g. "ORDER": "sections_avg"
Creating an object to sort by ascending or descending order and by multiple columns

e.g., "ORDER": {"dir": "DOWN", "keys": ["maxSeats"]}

"dir"
The order of the sorting is set by the direction ("dir") :

- "dir": "UP": Sort results ascending.
- "dir": "DOWN": Sort results descending.

"keys"

The "keys" field allows for sorting by multiple keys (i.e., columns), where each additional key resolves ties for the previous key.

For example:

- "keys": ["sections_avg"]: sorts by a single key
- "keys": ["sections_year", "sections_avg"]: sorts by multiple keys. In this case, the section average should be used to resolve ties for sections in the same year

Requirements:

All SORT keys must also be in the COLUMNS.

Valid query keys

Valid query keys follow the same format as specified in the Sections Specification.

3.4 Query Examples

Aggregation Example

First, note that WHERE is completely independent of GROUP/APPLY. WHERE filtering happens first, then GROUP/APPLY are performed on those filtered results.

GROUP: [term1, term2, ...] signifies that a group should be created for every unique set of all N-terms. For example, GROUP: [sections_dept, sections_id] would create a group for every unique (department, id) pair in the sections dataset. Every member of a group will always have the same values for each key in the GROUP array (e.g., in the previous example, all members of a group would share the same values for sections_dept and sections_id).

As an example, suppose we have the following courses dataset (for the sake of simplicity, some keys are omitted):

[

{ "sections_uuid": "1", "sections_instructor": "Jean", "sections_avg": 90, "sections_title" : "310"},

{ "sections_uuid": "2", "sections_instructor": "Jean", "sections_avg": 80, "sections_title" : "310"},

{ "sections_uuid": "3", "sections_instructor": "Casey", "sections_avg": 95, "sections_title" : "310"},

{ "sections_uuid": "4", "sections_instructor": "Casey", "sections_avg": 85, "sections_title" : "310"},

{ "sections_uuid": "5", "sections_instructor": "Kelly", "sections_avg": 74, "sections_title" : "210"},

{ "sections_uuid": "6", "sections_instructor": "Kelly", "sections_avg": 78, "sections_title" : "210"},

{ "sections_uuid": "7", "sections_instructor": "Kelly", "sections_avg": 72, "sections_title" : "210"},

{ "sections_uuid": "8", "sections_instructor": "Eli", "sections_avg": 85, "sections_title" : "210"}

]

We want to query the above dataset to aggregate sections by their title and obtain their average. Our aggregation query would look like this:

{

"WHERE": {},

"OPTIONS": {

"COLUMNS": ["sections_title", "overallAvg"]

"TRANSFORMATIONS": {

"GROUP": ["sections_title"],

"APPLY": [{

"overallAvg": {

"AVG": "sections_avg"

}

}]

}

For this query, there are two groups: one that matches "sections_title" = "310" and one other that matches "210". At some point you will likely need to have an intermediate data structure to create/hold on your groups; use whatever structure that feels natural to you.

Continuing with our example, we have these groups:

310 group = [

{ "sections_uuid": "1", "sections_instructor": "Jean", "sections_avg": 90, "sections_title" : "310"},

{ "sections_uuid": "2", "sections_instructor": "Jean", "sections_avg": 80, "sections_title" : "310"},

{ "sections_uuid": "3", "sections_instructor": "Casey", "sections_avg": 95, "sections_title" : "310"},

{ "sections_uuid": "4", "sections_instructor": "Casey", "sections_avg": 85, "sections_title" : "310"}

]

210 group = [

{ "sections_uuid": "5", "sections_instructor": "Kelly", "sections_avg": 74, "sections_title" : "210"},

{ "sections_uuid": "6", "sections_instructor": "Kelly", "sections_avg": 78, "sections_title" : "210"},

{ "sections_uuid": "7", "sections_instructor": "Kelly", "sections_avg": 72, "sections_title" : "210"},

{ "sections_uuid": "8", "sections_instructor": "Eli", "sections_avg": 85, "sections_title" : "210"}

]

The last step is fairly simple, we execute the apply operation in each group. The average of "310" group is (90 + 80 + 95 + 85)/4 = 87.5 whereas for the "210" group the average is 77.25. Our final result for the above query would be:

[

{ "sections_title" : "310", "overallAvg": 87.5},

{ "sections_title" : "210", "overallAvg": 77.25}

]

Notice that we can have more elaborate groups, such as discovering if a specific instructor of a section has a better average than other instructors (i.e., "GROUP": ["sections_instructor", "sections_title"]). In that case, we would have four groups: (310, Jean), (310, Casey), (210 , Kelly), and (210, Eli).

Rooms Query Example

Below is another example of a valid query and its results. The query is asking to look for all rooms that contain tables and more than 300 seats. Then, it asks to group the rooms by their building (shortname) and to find the room with the maximum capacity. The query will return the room with the maximum capacity per building in descending order.

{

"WHERE": {

"AND": [{

"IS": {

"rooms_furniture": "*Tables*"

}

}, {

"GT": {

"rooms_seats": 300

}

}]

"OPTIONS": {

"COLUMNS": [

"rooms_shortname",

"maxSeats"

"ORDER": {

"dir": "DOWN",

"keys": ["maxSeats"]

}

"TRANSFORMATIONS": {

"GROUP": ["rooms_shortname"],

"APPLY": [{

"maxSeats": {

"MAX": "rooms_seats"

}

}]

}

Response:

[
{

"rooms_shortname": "OSBO",

"maxSeats": 442

},
{

"rooms_shortname": "HEBB",

"maxSeats": 375

},
{

"rooms_shortname": "LSC",

"maxSeats": 350

}
]

3.5 Handling crashes

This section is identical to the insightUBC Section Specification.

Just like sections, room datasets should be accessible after a crash, so the room datasets will need to be saved to disk in the <PROJECT_DIR>/data directory.

4. Implementation

4.1 Implementing insightUBC Room Specification

4.1.1 Dataset Processor: HTML Parsing with Parse5

The room dataset contains HTML files which will need to be parsed.

There is a provided package called parse5 that you should use to parse the HTML files into a more convenient-to-traverse JSON format (you should only need the parse method). Parse5 also has an online playground where you can visualize the structure of a Document, which is the output of a parsed HTML file. You must traverse this document in order to extract the buildings/rooms information.

There are many ways to structure an HTML file to display the same information. It is important in your parsing to not hard code the parsing of the HTML tree. Instead, focus on searching the document tree for nodes that match the specification. For example, there can be many <table> elements in the index.htm file, so your code should search for all <table>s and find the one that satisfies the specification (i.e., has valid building/rooms data). Ultimately, if you find yourself looking for Document nodes based on some hardcoded positions (eg. children[0].children[1].children[0].text), you'll want to change your approach!

Browser Development Tools

HTML is much harder to read than JSON. Every browser comes with development tools to view and interact with the HTML for the displayed page. A great way to familiarize yourself with the structure of the campus.zip is to open the index.htm file in your browser and inspect the HTML elements using the browser development tools. You can use the inspector to move through the HTML tree and click on the links to open up the building files.

Chrome Developer Tools: Open the index.htm file in Chrome, then open the developer tools to inspect elements. You can click links to open building files.

4.1.2 Dataset Processor: Geolocation

Sending the Request

To send these requests, you must use the http package.

Although the request is a GET, you cannot test the response by posting it directly into your browser URL (like Chrome). The browser will automatically convert the http to https, and the request will be rejected.

The best way to test the Geolocation locally is by using the curl command from your terminal. For example, you can use the following command, where google.com is replaced with your team's URL.

curl -i http://google.com

Encoding the Address

To encode the address, use the function encodeURIComponent() (documentation link).

4.1.3 Query Engine: Apply

TypeScript/JavaScript numbers are represented by floating point numbers, performing this arithmetic can return different values depending on the order the operations take place. So, certain operations must be handled with care.

Perform the following steps exactly when implementing the following:

AVG: Must use the Decimal package (already included in your package.json).
1. Convert each of your value to a Decimal:

e.g., new Decimal(num)

1. Add the numbers being averaged using Decimal's add() method (and building up a variable called total).
2. Calculate the average. numRows should not be converted to a Decimal:

e.g., let avg = total.toNumber() / numRows)

1. Round the average to the second decimal digit with toFixed(2) and cast the result back to a number type. When casting to a number, you may appear to "lose" decimal places, for instance Number("2.00") will display as 2. This is okay.

e.g., let res = Number(avg.toFixed(2))

SUM: Use toFixed(2) to round to two decimal places.

4.1.4 Query Engine: Sorting

Sorting should be according to the < operator in TypeScript/JavaScript, not by localeCompare.

localeCompare is significantly slower than the < operator and is very configurable which can lead to performance issues and hard to diagnose differences between local tests in your development environment and AutoTest.

4.2 Advice

Testing

Like with C1, you will want to create your own zip files for testing. However, the rooms zip does not contain a root folder, so be careful with how you create your zip file to not include a root folder. The index.htm file should exist at the root of the zip.

4.3 ESLint Updates

310-bot will create a pull request against the main branch of your repository with changes to the linting rules. You will need to merge the PR to accept this change and update your repo. There are two new major lint rules that will be used and a minor stylistic change.

max-nested-callbacks

Many JavaScript libraries use the callback pattern to manage asynchronous operations. A program of any complexity will most likely need to manage several asynchronous operations at various levels of concurrency. A common pitfall that is easy to fall into is nesting callbacks, which makes code more difficult to read the deeper the callbacks are nested.

This rule enforces a maximum depth that callbacks can be nested to increase code clarity.

The nested callback anti-pattern exposed by creating Promises inside Promises hampers both readability and maintainability. In addition, nested callbacks obscure the traceability of a program's control flow. This rule encourages better stylistic practice like Promise chaining, and frees up horizontal screen real estate by decreasing indentation.

max-statements-per-line

A line of code containing too many statements can be difficult to read. Code is generally read from the top down, especially when scanning, so limiting the number of statements allowed on a single line can benefit both readability and maintainability.

This rule enforces a maximum number of statements allowed per line.

This lint rule discourages cramming too much different behaviour into a single line, or taking the "quick fix" approach to dealing with a line length lint rule by making short methods names so they can be compacted onto one line. Descriptive names and readable lines are important and shouldn't be sacrificed for compactness.

lines-between-class-members:always

Enforces lines between class members.

5. Resources

5.1 Getting Started

There are several ways to get started. Some possible options that could be pursued, in any order:

Watch/read the recommended videos and tutorials. These really help sink in the idea of HTML parsing and working with async code.

Start by looking at the rooms kind dataset we have provided and understanding what kind of data you will be analyzing and manipulating. It is crucial to understand that index.htm and the building files have different structures. You will need to extract different, though complementary information, from each one of them. You can open up the HTML files in your browser to inspect them and use the parse5 online playground to understand them.

Ignoring the rest of the dataset parsing, consider writing a method to get a building's geolocation along with tests for this helper method.
Ignoring the provided dataset, create a mock dataset with fewer rows. Write the portion of the system that would perform the GROUP and APPLY operations on this small dataset.

Trying to keep all of the requirements in mind at once can be overwhelming. Tackling a single task that you can accomplish in an hour is going to be much more effective than worrying about the whole specification at once. Iteratively growing your project from small task to small task is going to be the best way to make forward progress.

5.2 Recommended Videos & Tutorials

The following resources have been created by course staff to assist you with the project.

Videos

HTML Parsing tips: Reviews the structure of HTML and how to search for an HTML element

Tutorials

Async Cookbook: learn about promises and the differences between synchronous and asynchronous code.

Page updated

Google Sites

Report abuse