GitHub Repository
VerseScraper Tool Repository
January 24th-31st
Research methods to programmatically analyze rhyme schemes
Goals
- Determine a reliable method to extract phonetic metadata of lyrics
- Use this phonetic data to find similarity between words and find patterns
- Python Libraries
- Jellyfish: Supports approximate and phonetic matching of strings
- Includes many string comparison and phonetic encoding algorithms
- Phonetic encoding appears more promising than string comparison
- String comparison only measures edit distance, so phonetics would be largely ignored
- e.g. "Tomb" and "Bomb" would have an edit distance of one, but do not rhyme (perfectly)
- String Comparison (i.e. edit distance)
- Levenshtein Distance
- Damerau-Levenshtein Distance
- Jaro Distance
- Jaro-Winkler Distance
- Match Rating Approach Comparison
- Hamming Distance
- Phonetic Encodings
- American Soundex
- Metaphone
- NYSIIS (New York State Identification and Intelligence System)
- Match Rating Codex
- Poetry-Tools: Performs analysis of poetry and finds rhyme schemes
- Performs prosodic analysis
- Prosodic analysis seems to focus less on phonetics than syllables and other units of speech, e.g. intonation, tone, and stress
January 31st-February 7th
Analysis of aforementioned Python libraries
- Jellyfish
- Sadly, the phonetic encoding methods provided by this library do not consider the real pronunciation of words
- The words "bomb" and "tomb" have the same phonetic encoding, regardless of algorithm
- Poetry-Tools
- Like jellyfish, similarly spelled words that are pronounced differently (refer back to "tomb" and "bomb") are falsely marked as rhyming
- Uses NLTK for the comparisons, unlike Jellyfish which uses phonetic encodings and string comparison algorithms
Instead of Jellyfish or Poetry-Tools, I will look towards NLTK for finding rhymes. A quick try of this code shows that the NLTK library's CMU Dict can be used to good effect. NOTE: the `level` parameter to the function `rhyme` adjusts how "good" the rhyme must be to be included in the result set. In my experience, a level of 1 returns a very large set of rhymes. Here, "tomb" would rhyme with "bomb". If the level is switched to 2, the returned set of words that rhyme is constricted. Now, "tomb" does NOT rhyme with "bomb".
This provides me with a rudimentary system of checking whether two words rhyme with one another. However, imperfect rhymes, which are found often in Hip Hop, are not accounted for. Take the words "bomb" and "qualm", for instance. These words do not rhyme perfectly, and the rhyme checker using NLTK says the same (NOTE: turns out "qualm" does not appear in CMUDict). That being said, the two words share a common vowel sound and could feasibly be used in a rhyme scheme to good effect. Perhaps I could create a wider definition of what constitutes a rhyme in the context of Hip Hop.
I then went on to create a simple script in python that takes in a text file of lyrics and a word to check rhymes against. The script generates a dictionary of all rhymes for each word in the text. This is used to reduce redundant calculations of a words rhymes. However, even with this, the script is painfully slow, taking at least five minutes for a single verse. In addition, the script does not work for imperfect rhymes (i.e. words like "wind" and "ten" do not technically rhyme, but share a similar vowel sound). Furthermore, multi-word rhymes (e.g. "floats the wind" and "count to ten") are not even close to being recognized.
Tackling the Imperfect Rhyme
- The CMUDict uses a set of phonemes to characterize the pronunciation of a given word. Furthermore, it uses integers 0, 1, and 2 to denote the stress put on a certain phoneme. 0 corresponds to "no stress", 1 to "primary stress", and 2 to "secondary stress"
- Example phonemes returned by CMUDict
- "ten": T, EH1, N
- "wind": W, IH1, N, D
- "millennium": M, AH0, L, EH1, N, IY0, AH0, M
- Potential solutions
- Create a dictionary mapping a phoneme to similar-sounding phonemes
- e.g. the "IH" would map to an array containing "EH"
- Would allow for imperfect rhymes to be accounted for
- Would need to be done manually and would likely not cover all cases
February 7th-14th
Look into Raplyzer, which detects assonance rhymes using eSpeak to get vowel sounds from English words.
Raplyzer
- Installing eSpeak onto Mac seemed to be difficult until I found that you can install it using Homebrew
- The analyzer seems to do a good job of extracting vowel sounds from english text with eSpeak
- After running it over some lyrics from MF DOOM, Andre 3000, J.I.D., and Mos Def, MF DOOM is a clear favorite by the analyzer. This is likely due to MF DOOM's dense use of assonance in his rhyme schemes. J.I.D., an up-and-coming rapper known for his heady lyricism, was placed second out of this group.
Acquiring Rap Lyrics
- For my project, I will need a decent amount of rap lyrics organized by artist
- LyricsGenius is a python library that provides a means to scrape lyrics from Genius
- Currently running into issues where the API request times out
- Need to utilize the above library in conjunction with a Python script that will retrieve all lyrics for given artists
- Should explore the content of these lyrics to see if they have metadata containing whose verse is whose
- e.g. a rap song may have a featured artist/multiple people rapping, so distinguishing between who's at the mic would be beneficial
- For finding rhyme schemes for a single song, the differentiation is not as necessary, but it could be useful for rhyme analysis between artists
- Rhyme schemes should ideally start over at the beginning of any verse
February 14th-21st
With three tests this week and next, as well as a retreat this weekend for my fraternity, I have not had a ton of time to work on this.
Continued working on my lyric scraping script. Currently, it can accept a list of artists and will generate a JSON file that contains verses for each song found for a given artist. Has the format { artist: { album: { song: [verses] } } } .
From the Genius API, each song has a format such that each verse is preceded with something like [ Verse 1: Andre 3000 ], so I wrote a regular expression to capture the type of verse, the artist delivering the verse, and the content of the verse itself.
Now, I need to write the code such that a user could supply a list of artists and get a file/files containing verses for a given artist. The files could be one of two as follows:
- A single JSON file containing all verses for all artists queried
- JSON file could be easily worked with in the future, but could become cumbersome and large
- A distributed set of folders that contain raw text files containing verses
- Folders would be of format <Artist>/<Album>/<Song>.txt
- Would allow for easy integration with Raplyzer as it looks for this particular folder structure
- Could make reuse more difficult
As it stands, I have the script do this
- Iterate over an array of artist names
- Fetch the most popular songs for each artist (number fetched can be altered)
- Generate a dictionary for each song
- Merge each dictionary with the master lyric dictionary
- Dump the master lyric dictionary to a JSON file
- Read through the JSON file and generate a set of folders and text files as detailed above
TO DO
- Implement some sort of alias system
- Provide a list of aliases for a given artist
- Would allow for artists who operate under multiple names or in multiple groups to be accounted for
- e.g. MF DOOM has put out work under names like "Viktor Vaughn", "King Geedorah, "Madvillian", etc.
- Handle edge cases where songs with features have the featured artist as part of the song's artist name AND the artist name isn't listed in verse metadata
- e.g. the song "LOYALTY" by Kendrick Lamar and Rihanna-- They are both listed as the artists for the song
February 21st-February 28th
Two tests this week, so my time available to work on this was limited.
Continue work on handling edge cases for my lyric scraping tool
- Spent most of my time fiddling with the regular expression that finds verses in lyrics returned by LyricsGenius
- Added a rule to my regular expression to ignore an artist's name following an "&"
- e.g. in Kanye West's "Father Stretch My Hands pt. 1", his verse is listed with "[Verse: Kanye West & Kelly Price]"
- By ignoring things after the ampersand, we can help ensure that artists' verses do not get split up based on collaborators
- Needed some tweaks such that it could match descriptors with both &s and without
- Ended up ditching the previous rule and handle it within my Python script instead
- The regular expression will match the full artist string (including ones with &s)
- The python script will look at the match and check if the artist's name provided by the function parameter is contained within the match
- If it is, then the artist parameter value will be used instead of the one found by the regex
- Issue: sometimes the artist's name is contained in the match, but occurs as the second name after the "&"
- e.g. "[Big Sean & Kanye West]" would be Big Sean's verse, but my script takes it as Kanye West's
- Fix: First check if the artist's name is in the match, then see if it occurs as the first element in the string
Potential Improvements
- For a smaller set of artists, the tool is only bottlenecked by the time taken to fetch lyrics from the Genius API
- For a large number of artists, I am concerned that my current method of storing the lyric dictionary would exhaust system resources
- Now, the lyric dictionary is held in memory and appended to for each song found
- The lyric dictionary is written to the disk once all songs are fetched and processed
- A probable solution would be to save the lyric dictionary after every artist
- Not after every song, this would require many system calls
- The JSON file would need to be fetched and merged after each artist
February 28th-March 14th
Goal: Examine Raplyzer to see how it detects rhymes
- Determine whether I should use Raplyzer in my final rhyme-scheme analyzer or only utilize the logic found within it
- i.e. should I use functions already included in Raplyzer for rhyme scheme analysis? or should I rewrite some of the functions for my own use?
- On first glance, the code is very dense and hard to unpack--may be ideal for me to utilize Raplyzer as a black box
- Put lyrics in, get rhymes out
- Raplyzer does not keep track of every rhyme in the lyrics supplied
- Instead, it searches for the longest rhyme in the text and keeps track of the average rhyme length along the way
- Rhyme data is stored in the form of (rhyme length, starting word position, ending word position)
- May want to edit the source code so that it saves every rhyme it comes across
- Made a slight edit so the Lyrics object tracks every rhyme tuple it finds
- Some of these end up being duplicates despite having slightly different values--need to investigate
- Raplyzer has a key difference in how it finds rhymes vs. how I want to find them
- Raplyzer finds rhymes within a certain number of words of a particular vowel sound
- For my purposes, I would want to find all occurrences of the particular rhyme to generate a scheme like below