Rob Firstman

GitHub Repository

VerseScraper Tool Repository

January 24th-31st

Research methods to programmatically analyze rhyme schemes

Goals

Determine a reliable method to extract phonetic metadata of lyrics
Use this phonetic data to find similarity between words and find patterns
Python Libraries
- Jellyfish: Supports approximate and phonetic matching of strings
  - Includes many string comparison and phonetic encoding algorithms
  - Phonetic encoding appears more promising than string comparison
  - String comparison only measures edit distance, so phonetics would be largely ignored
    - e.g. "Tomb" and "Bomb" would have an edit distance of one, but do not rhyme (perfectly)
  - String Comparison (i.e. edit distance)
    - Levenshtein Distance
    - Damerau-Levenshtein Distance
    - Jaro Distance
    - Jaro-Winkler Distance
    - Match Rating Approach Comparison
    - Hamming Distance
  - Phonetic Encodings
    - American Soundex
    - Metaphone
    - NYSIIS (New York State Identification and Intelligence System)
    - Match Rating Codex
- Poetry-Tools: Performs analysis of poetry and finds rhyme schemes
  - Performs prosodic analysis
  - Prosodic analysis seems to focus less on phonetics than syllables and other units of speech, e.g. intonation, tone, and stress

January 31st-February 7th

Analysis of aforementioned Python libraries

Jellyfish
- Sadly, the phonetic encoding methods provided by this library do not consider the real pronunciation of words
- The words "bomb" and "tomb" have the same phonetic encoding, regardless of algorithm
Poetry-Tools
- Like jellyfish, similarly spelled words that are pronounced differently (refer back to "tomb" and "bomb") are falsely marked as rhyming
- Uses NLTK for the comparisons, unlike Jellyfish which uses phonetic encodings and string comparison algorithms

Instead of Jellyfish or Poetry-Tools, I will look towards NLTK for finding rhymes. A quick try of this code shows that the NLTK library's CMU Dict can be used to good effect. NOTE: the `level` parameter to the function `rhyme` adjusts how "good" the rhyme must be to be included in the result set. In my experience, a level of 1 returns a very large set of rhymes. Here, "tomb" would rhyme with "bomb". If the level is switched to 2, the returned set of words that rhyme is constricted. Now, "tomb" does NOT rhyme with "bomb".

This provides me with a rudimentary system of checking whether two words rhyme with one another. However, imperfect rhymes, which are found often in Hip Hop, are not accounted for. Take the words "bomb" and "qualm", for instance. These words do not rhyme perfectly, and the rhyme checker using NLTK says the same (NOTE: turns out "qualm" does not appear in CMUDict). That being said, the two words share a common vowel sound and could feasibly be used in a rhyme scheme to good effect. Perhaps I could create a wider definition of what constitutes a rhyme in the context of Hip Hop.

I then went on to create a simple script in python that takes in a text file of lyrics and a word to check rhymes against. The script generates a dictionary of all rhymes for each word in the text. This is used to reduce redundant calculations of a words rhymes. However, even with this, the script is painfully slow, taking at least five minutes for a single verse. In addition, the script does not work for imperfect rhymes (i.e. words like "wind" and "ten" do not technically rhyme, but share a similar vowel sound). Furthermore, multi-word rhymes (e.g. "floats the wind" and "count to ten") are not even close to being recognized.

Tackling the Imperfect Rhyme

The CMUDict uses a set of phonemes to characterize the pronunciation of a given word. Furthermore, it uses integers 0, 1, and 2 to denote the stress put on a certain phoneme. 0 corresponds to "no stress", 1 to "primary stress", and 2 to "secondary stress"
Example phonemes returned by CMUDict
- "ten": T, EH1, N
- "wind": W, IH1, N, D
- "millennium": M, AH0, L, EH1, N, IY0, AH0, M
Potential solutions
- Create a dictionary mapping a phoneme to similar-sounding phonemes
  - e.g. the "IH" would map to an array containing "EH"
  - Would allow for imperfect rhymes to be accounted for
  - Would need to be done manually and would likely not cover all cases

February 7th-14th

Look into Raplyzer, which detects assonance rhymes using eSpeak to get vowel sounds from English words.

Raplyzer

Installing eSpeak onto Mac seemed to be difficult until I found that you can install it using Homebrew
The analyzer seems to do a good job of extracting vowel sounds from english text with eSpeak
After running it over some lyrics from MF DOOM, Andre 3000, J.I.D., and Mos Def, MF DOOM is a clear favorite by the analyzer. This is likely due to MF DOOM's dense use of assonance in his rhyme schemes. J.I.D., an up-and-coming rapper known for his heady lyricism, was placed second out of this group.

Acquiring Rap Lyrics

For my project, I will need a decent amount of rap lyrics organized by artist
LyricsGenius is a python library that provides a means to scrape lyrics from Genius
- Currently running into issues where the API request times out
Need to utilize the above library in conjunction with a Python script that will retrieve all lyrics for given artists
- Should explore the content of these lyrics to see if they have metadata containing whose verse is whose
- e.g. a rap song may have a featured artist/multiple people rapping, so distinguishing between who's at the mic would be beneficial
- For finding rhyme schemes for a single song, the differentiation is not as necessary, but it could be useful for rhyme analysis between artists
  - Rhyme schemes should ideally start over at the beginning of any verse

February 14th-21st

With three tests this week and next, as well as a retreat this weekend for my fraternity, I have not had a ton of time to work on this.

Continued working on my lyric scraping script. Currently, it can accept a list of artists and will generate a JSON file that contains verses for each song found for a given artist. Has the format { artist: { album: { song: [verses] } } } .

From the Genius API, each song has a format such that each verse is preceded with something like [ Verse 1: Andre 3000 ], so I wrote a regular expression to capture the type of verse, the artist delivering the verse, and the content of the verse itself.

Now, I need to write the code such that a user could supply a list of artists and get a file/files containing verses for a given artist. The files could be one of two as follows:

A single JSON file containing all verses for all artists queried
- JSON file could be easily worked with in the future, but could become cumbersome and large
A distributed set of folders that contain raw text files containing verses
- Folders would be of format <Artist>/<Album>/<Song>.txt
- Would allow for easy integration with Raplyzer as it looks for this particular folder structure
- Could make reuse more difficult

As it stands, I have the script do this

Iterate over an array of artist names
1. Fetch the most popular songs for each artist (number fetched can be altered)
2. Generate a dictionary for each song
3. Merge each dictionary with the master lyric dictionary
Dump the master lyric dictionary to a JSON file
Read through the JSON file and generate a set of folders and text files as detailed above

TO DO

Implement some sort of alias system
- Provide a list of aliases for a given artist
- Would allow for artists who operate under multiple names or in multiple groups to be accounted for
  - e.g. MF DOOM has put out work under names like "Viktor Vaughn", "King Geedorah, "Madvillian", etc.
Handle edge cases where songs with features have the featured artist as part of the song's artist name AND the artist name isn't listed in verse metadata
- e.g. the song "LOYALTY" by Kendrick Lamar and Rihanna-- They are both listed as the artists for the song

February 21st-February 28th

Two tests this week, so my time available to work on this was limited.

Continue work on handling edge cases for my lyric scraping tool

Spent most of my time fiddling with the regular expression that finds verses in lyrics returned by LyricsGenius
Added a rule to my regular expression to ignore an artist's name following an "&"
- e.g. in Kanye West's "Father Stretch My Hands pt. 1", his verse is listed with "[Verse: Kanye West & Kelly Price]"
- By ignoring things after the ampersand, we can help ensure that artists' verses do not get split up based on collaborators
- Needed some tweaks such that it could match descriptors with both &s and without
Ended up ditching the previous rule and handle it within my Python script instead
- The regular expression will match the full artist string (including ones with &s)
- The python script will look at the match and check if the artist's name provided by the function parameter is contained within the match
- If it is, then the artist parameter value will be used instead of the one found by the regex
- Issue: sometimes the artist's name is contained in the match, but occurs as the second name after the "&"
  - e.g. "[Big Sean & Kanye West]" would be Big Sean's verse, but my script takes it as Kanye West's
- Fix: First check if the artist's name is in the match, then see if it occurs as the first element in the string

Potential Improvements

For a smaller set of artists, the tool is only bottlenecked by the time taken to fetch lyrics from the Genius API
For a large number of artists, I am concerned that my current method of storing the lyric dictionary would exhaust system resources
- Now, the lyric dictionary is held in memory and appended to for each song found
- The lyric dictionary is written to the disk once all songs are fetched and processed
- A probable solution would be to save the lyric dictionary after every artist
  - Not after every song, this would require many system calls
- The JSON file would need to be fetched and merged after each artist

February 28th-March 14th

Goal: Examine Raplyzer to see how it detects rhymes

Determine whether I should use Raplyzer in my final rhyme-scheme analyzer or only utilize the logic found within it
- i.e. should I use functions already included in Raplyzer for rhyme scheme analysis? or should I rewrite some of the functions for my own use?
On first glance, the code is very dense and hard to unpack--may be ideal for me to utilize Raplyzer as a black box
- Put lyrics in, get rhymes out
Raplyzer does not keep track of every rhyme in the lyrics supplied
- Instead, it searches for the longest rhyme in the text and keeps track of the average rhyme length along the way
- Rhyme data is stored in the form of (rhyme length, starting word position, ending word position)
- May want to edit the source code so that it saves every rhyme it comes across
Made a slight edit so the Lyrics object tracks every rhyme tuple it finds
- Some of these end up being duplicates despite having slightly different values--need to investigate
Raplyzer has a key difference in how it finds rhymes vs. how I want to find them
- Raplyzer finds rhymes within a certain number of words of a particular vowel sound
- For my purposes, I would want to find all occurrences of the particular rhyme to generate a scheme like below

Potential algorithm for finding a rhyme scheme

Use Raplyzer to find a particular rhyme
Take that rhyme and look for occurrences of similar rhymes thereafter
"Highlight" individual syllables of each rhyme (based on rhyming vowels)
- This is more advanced, I may just settle for finding the common rhymes

Problems with Raplyzer

I fixed a bug Raplyzer ran into when handling artist names that are more than two words
Raplyzer would naively construct ESpeak commands without wrapping the arguments in parenthesis
Explicitly adding parenthesis to the command parameters fixed the issue

Lyric Scraping Improvements

Currently, artists are sometimes split up by songs that contain features and those that do not
Example: Kanye West appears in the resulting lyric file, but so does "Kanye West and Kelly Price"
- Note: the verse under "Kanye West and Kelly Price" is one performed by Kanye West
This may be a side affect of how I handle ambiguous verse attributions
- On Genius, verses may or may not have an artist's name attached (e.g. "[Verse 1: Kanye West]" or "[Verse 1]")
- If the artist's name is not included, my script will use the artist attributed to the song by the Genius API
- If a song has featured artists, they too will be in the artist field (e.g. "Kanye West & Chance the Rapper")
Or it is a side affect of how I fetch artist names in verses
- A verse may have multiple artists attributed to it
  - e.g. "[Verse 3: Kanye West & 2 Chainz]"
- Usually, the first artist listed is delivering the majority of the verse
  - However, this is not always the case
  - There likely exist verses where two rappers are going back and forth
- This is, however, likely the case for almost all verses
Potential solutions
- Simply ignoring anything after the first "&" would be an easy solution, albeit rigid

Prep for the 3/4th Presentation

Currently, my progress on automatically analyzing rhyme schemes is smaller than anticipated
However, my lyric scraping tool is well fleshed out and may be the focus of my presentation
- Demonstrate how it is used and why it is useful
- I have not yet found a lyric scraper that breaks songs into their individual verses
- Could focus on that, given that our group's focus is analyzing hip hop and jazz lyrics

March 14th-March 28th

Prepare for 3/4 Presentation

Will primarily be showing off my lyric scraping tool
- Focus on how this tool is different from other lyric fetching tools
  - No other tool fetches individual verses for a particular artist
  - Furthermore, these verses can be fetched from songs containing multiple verses from different artists
Will also touch on my next step--rhyme scheme analysis
- Wrote a short script using Raplyzer to find rhyming words for a given vowel sound
- Can run this over all vowels found using the eSpeak vowel representation
- Some interesting "vowels" come up from this representation
  - e.g. "V", example words: 'welcome', 'what', 'what', 'one', 'just'
  - e.g. "3" and "L", the word "original" has both of these sounds
  - e.g. "U" (not "u"), example words: 'floats', 'count', 'so', 'go', 'out', "yo'", 'Look', 'now', 'could'

April

Prepare for Final Presentation

Put some finishing touches on my verse scraper and on my rhyme scheme generation tool (see below)
Put verse scraping tool on the Python Package Index (see below)
My presentation will demonstrate the capabilities of my verse scraping tool as well as show sample output from my rhyme scheme tool.

Rhyme Scheme Generation -- "RhymeSchemer"

Ditched the idea of display visual representation of rhyme schemes to the command line
- There are too few colors available at the command line for this to work properly
Wrote function to write rhyme schemes to an HTML file
- Randomly generate hex colors for each vowel sound
- Color each word depending on what vowel it rhymes with
  - TODO: gracefully handle words that have multiple rhymes
    - Potential solution: only add vowel-related colors for the syllables in the word that contain said vowels
- Example output below

Verse Scraping Tool

Uploaded my tool to the Python Package Index (PyPI)
- i.e. my tool can be installed with pip
- "pip install versescraper"
PyPI Link
GitHub Repository

Google Sites

Report abuse