The data was collected from Pitchfork and aggregated into two separate files-- one for pop, one for experimental
These files are called upon in the terminal command line and in doing so, run through my program. The first function of the program cleans the text by lowercasing all words, removing all punctuation, special characters, and numbers and returning a cleaned list of the text
The next function of my program takes the cleaned text files and removes all instance of stop words in each, and creates two new text files that are cleaned lists. Next, I use python's Counter to find the 20 most commonly occurring words in each file. I then find which words are in common between those two lists of 20 most common words. I also find which words are unique to each file.
Next, there are several functions in the code written with the ultimate goal to output the Automated Readability Index. The functions that make this happen are a function that counts the words in each file, a function that counts the amount of characters per file, a function that counts the sentences per file, and finally a function that puts it all together via the ARI's equation.
The ARI is an index that relies on the ratio of characters per words and words per sentences to determine how "readable" a text file is, where the harder the file is to read, the higher the index score is. The ARI represents the grade at which the text is easily read.
ARI Equation: automated_index = 4.71 * count_characters/count_words + 0.5 * count_words/count_sentences - 21.43
The final step is calling your work in main. This project's main function includes two sys.argv inputs, which is where each of the filenames will be input in your terminal. If the file is not correct, the program prints an error notice.
The main also runs the word cleaner twice, once per file, to establish the clean text files for the rest of the program to take.
The main also runs the most_common_words function specifically, because that function is not called in the automated_index aggregator function as many of the other functions are.
Finally, the main runs the automated_index function twice, once per file to get each file's ARI
The python file that runs the above mentioned code is attached below! To run it, type in your command file:
python3 JuliaElia_Project2.py FILENAME1 FILENAME2
If you've downloaded the relevant data files to this project, found on the Data page of this site, then the command line will look like:
python 3 JuliaElia_Project2.py ExperimentalReviews_Uncleaned.txt PopReviews_Uncleaned.txt
...
What you'll get is a print out of the 20 most common words in each file, the 20 words in common, the most common words unique to each file and the Automated Readability Index of each file
...
Questions? More detailed aspects of the methods of each function are included in comments on the project's code