Lab 0: Warmup

Since this is the first week, we're just doing a partial lab (worth points) to get going. The lab is due September 7, 2021 via Gradescope.

Starter Code

Get the starter code on GitHub Classroom! This will require you to set up things with a GitHub account. If you're having issues with authentication, you can check up the GitHub help at the bottom of the page here.

Biographical Survey (3 pts)

Please make sure to fill out the course biographical survey. I use this both to get a sense of people's interests as well as to learn if there is anything you need in this course. While you'll get points on Gradescope for doing this, it's graded solely on completion by me; the grutors won't see the contents of this form.

Setting Up (5 pts)

There’s a Docker image available for this course. See the course Docker instructions for details about how to work on your own machine.

If you would prefer to use knuth, you can open up a tool that supports SSH (e.g. Terminal on a Mac, PuTTy or Powershell on Windows) and run

ssh username@knuth.cs.hmc.edu

and then enter your password for the server. If you used the lab machines at Harvey Mudd in the past, this is the same account. If you either don't have such an account or have forgotten the password, let me know here.

To save yourself time for future weeks, I'd like you to make sure you can do the following two things on the command line, either by logging into knuth or in Docker (whichever you'd prefer to do for this class). Put the requested response for each piece in the appropriate part of analysis.md, which you should convert to a pdf and submit to Gradescope. To do this, you can use a command called pandoc:

pandoc analysis.md -o analysis.pdf

If you need help getting pandoc running on your own machine, you can refer to the Docker instructions for a way to use Docker to do this. You'll get 1 point just for submitting the PDF.

  1. Navigate to the data directory /cs/cs159/data/gutenberg/. We can run the head command on that file to print out the first lines of the file. e.g.:

$ head carroll-alice.txt

[Alice's Adventures in Wonderland by Lewis Carroll 1865]


CHAPTER I. Down the Rabbit-Hole


Alice was beginning to get very tired of sitting by her sister on the

bank, and of having nothing to do: once or twice she had peeped into the

book her sister was reading, but it had no pictures or conversations in

it, 'and what is the use of a book,' thought Alice 'without pictures or

conversation?'


Please pick a file in the directory other than carroll-alice.txt and copy the output of the head command into your analysis.md file. (2 pts)

  1. Run python3 to activate the interpreter. In the interpreter, run the following commands and verify that they don't throw any errors:

>> import discord # relevant for lab 1

>> from spacy.lang.en import English # relevant for lab 3

>> nlp_model = English() # this makes an English spaCy model

Please verify that both of these import without errors. If there is any import error, run the following commands in the terminal to install the package. (Note: in general, we don't want you to have to install big packages this way on knuth, since you have a limited amount of space there, but this package should be small!)

>> pip3 install <package name> --user # For any package

>> pip3 install discord.py --user # For discord

While we're not using spaCy for a couple weeks, we're going to do a quick test to see one of the cool things that spaCy can do: pull out the pieces of text it thinks are URLs! Try pasting in a snippet of text and seeing what it does with the output. I tried this out using the text from the homepage:

>>> test_str = """

... [stuff I pasted in here]

... """

I then ran the following code to get a spaCy-processed version of my string split into pieces (we'll talk more about how it gets these pieces in Weeks 2 and 3). We can check which pieces it thinks are URLs using the like_url attribute of each piece:

>>> spacy_output = nlp_model(test_str)

>>> for piece in spacy_output:

... if piece.like_url:

... print(piece)

https://www.gradescope.com/courses/285728

https://web.stanford.edu/~jurafsky/slp3/

Paste the output you get into the corresponding section of analysis.md. Quickly comment: how well did the code match URLs as compared to your expectations? If there are examples of things you thought it would get right or wrong where it behaved differently, mention them here. (2 pts)

If you can do these things, you should be set to go for upcoming labs. If you're getting stuck, please reach out to me or the grutors to ask for help!

RegEx Practice (17 pts)

Practice regular expressions by playing RegEx Golf.

You must complete the following: (14 points, autograded)

  • Warmup (2pts)

  • Anchors (2pts)

  • It never ends (2pts)

  • Ranges (2pts)

  • Backrefs (2pts)

  • One additional puzzle of your choosing (4pts for the first puzzle, 1pt extra credit for each additional solution).

Screenshot of the RegEx Gold tool, showing words to match (which all contain "foo") and words not to match (none of which contain "foo").

Turn In

In the file golf.py, fill in the empty strings with your best (that is, shortest) regular expression for each of the puzzles that you solved. You'll get full points as long as your regular expression didn't avoid the challenge of writing rules that generalize properties of the words (e.g. r"^(word1|word2|word3...)$" would be ridiculously long and a little silly). That said, if your length seems much higher than what's on the leaderboards, challenge yourself to get one that's shorter! (You will get 1 pt for a file with any edits, even if the regexs don't work. Since there are tiny differences between how the web application works and how Python regular expressions work, there's a small chance you may pass the test in the game but not pass the autograder test; if this happens, just leave a note in you analysis writeup and we'll check it for you.)

Additionally, in analysis.md, include a few sentences describing how this activity went for you. Were any puzzles particularly challenging, and what made them challenging? Alternately, is there any regex you're particularly proud of devising, and if so, why? (2 pts)

When you are finished, make sure to use pandoc as described at the top of the lab to compile your analysis.md file as a PDF. Then, add your analysis.pdf to your Github repository before submitting through Gradescope.

Integrity Note

Since these puzzles are online, there are (undoubtedly) solutions posted somewhere. Under the HMC honor code, I trust you to submit the best solutions you can come up with on your own.