All of the information/data we needed came from IMDB. Since our project website relied heavily on aspects such as movie genres, runtime, and ratings, getting the data points we needed from such a big site really helped us make crisp visualizations with user interaction. The data can be downloaded by accessing this link: ftp://ftp.fu-berlin.de/pub/misc/movies/database/frozendata/
While this database had numerous files on movie information, we only took six files which included data on the movies themselves along with when they were released, what their rating was, how long the movie ran, what genre it fell under, and what certain keywords are associated with that given movie. All of these files downloaded needed to be extracted in order to be successfully opened.
When we first took an initial look at the files, we realized that we had some cleaning up to do in terms of removing bad rows as well as filtering out data which didn't include information on movies. First we created a new directory to store the files we needed along with creating new files for the movie names, release date, genres, certificates (ratings), and keywords respectively. Initially these files were binary files but we converted them to CSVs to make them easier to be read and have data pulled from.
Inside the file where we converted the downloaded files to CSVs, this is also where we did a majority of our data filtering. For all the files, we removed rows where a specific year wasn't defined. We also got rid of rows where the type of media was either TV, video, blue-ray, or a video game. Again we were only looking for data on movies so the other forms were not necessary.
For the certificates file specifically, we removed rows which had the rating of "X" or "NC-17" along with ratings which weren't from the United States. When it came to genres, we excluded data rows which included a genre that was for a TV show such as Reality TV or a news broadcast. Also since there were numerous genres when we first loaded the file, we removed any genre which had less than 100 entries because it would start to cloud up the visualizations. So we decided to stick with the main ones which occurred the most.
The same can be said for the keywords as we also set a minimum for how much a keyword has to be mentioned in the data to be included. We removed keywords which were used less than 20 times. Finally, when it came to the running times, we know that most movies are clearly longer than your average TV show so we decided to filter out data points which had a runtime of less than an hour. That way we didn't have to deal with the occasional film that is much shorter than the others.
In each of the tables we outputted we also included the movie name along with what specific thing about the movie we were going to be looking at. For example in the genres and runtime tables respectively, they looked like this:
After the filtering, we went ahead and combined these files to make one giant data table. Except for the keywords, we had a table which included the movie name, genre, certificate, runtime along with the year and month it was released.
This way it was easier to access and use the data to make visualizations (plots/graphs) for our project. By having our data cleaned up, we were able to easily make tables for different aspects of the data such as individual ones for year, genre, and certificates. Along with that we were able to make functions to order the data in certain ways so we can display them correctly in our visualizations. For example we were able to make functions to get the count of the month in which the movie was released, the average number of films released per year/decade and the distribution of genres throughout the entire data set.
Below are screenshots of our combined data table which we created so we can store everything in one place and also easily access certain parts of the data and the output we got the count of how many movies were released per decade.