An IMDb dataset has been downloaded in the form of multiple CSV files with each file containing movie information on an yearly basis. The files include the following columns: Title, Movie Link, Year, Duration, MPA, Rating, Votes, budget, grossWorldWide, gross_US_Canada, opening_weekend_Gross, directors, writers, stars, genres, countries_origin, filming_locations, production_companies, Languages, wins, nominations, oscars, release_date.
For the purpose of this project the last 30 years have been taken into consideration, from 1995 to 2024. The individual data files have been merged together into one which yields just under 16k records.
CSV file of year 1995
Combined CSV file of 30 years
TMDb stands for "The Movie Database". It is a tool which provides data on movies, actors, and other detailed information that can be accessed by developers for free. Missing values in the Kaggle Dataset are filled using this API to ensure maximum retention of information.
There are various endpoints that can be used to access the data. Below are two sample endpoints with their output:
Top rated movies - https://api.themoviedb.org/3/movie/top_rated?api_key="unique api key"
Popularity in descending order for the year 2023 - https://api.themoviedb.org/3/discover/movie?api_key="unique api key"&primary_release_year=2023&sort_by=popularity.desc
IMDb stands for "Internet Movie Database" and is a popular website to find information on particular movies or TV shows. Includes cast and crew lists, release dates, plot over view, trivia and more.
This website was just used for reference.
Step 1: Understanding the data and its deficiencies
The dataset is loaded into a dataframe using the Pandas library. Then the data is thoroughly looked at and analyzed to understand all the information present - what are the different columns, what information do they relay, what are their formats and so on.
In the snippet of one of the records we can see that there is a mix of qualitative data (Title, Movie Link) and qunatitative data (Budget, Rating).
Further analysis shows us that there are a total of 23 columns and 15850 records. As shown on the left the different data types of each column is also visible, this includes object, int and float types. Additionally, for each column the number of values which aren't null is also given. For example, out of 15850 values only 13498 values are given for the "MPA" column.
Columns which have missing values are - DURATION, MPA, RATING, VOTES, BUDGET, GROSSWORLDWIDE, GROSS_US_CANADA, OPENING_WEEKEND_GROSS, RELEASE DATE
For numerical columns boxplots have been generated to visually understand the values. The dots indicate any outliers present. As shown below 7 values are visually seen for each boxplot. For the case of BUDGET, the maximum value is 300B (Billion) , the upper fence of the box plot is 93M (Million), any values beyond this is an outlier. The third quartile (q3) is 40M, followed by the median at 15M and the first quartile (q1) at 4.144592M. The lower fence of the boxplot is at 1 and so is the minimum value.
The graphs below demonstrate the number of missing values in each of the columns. The heatmap shows, as seen by the yellow lines BUDGET has a lot of missing values. The numbers on the left of the heatmap correspond to the row numbers. RELEASE_DATE has some missing values. This information is represented in a new way by the graph on the right. It gives the count of missing values of each column, like for BUDGET the number of missing values is over 8000, that is followed by OPENING_WEEKEND_GROSS at a little over 3000 values.
The number of movies released per year can be seen on the left graph. The curve shows a constant number of movies being released with the minimum being 500. There are some spikes in the years like 2020, 2022, 2023 and 2024. The maximum number of movies are at around 600.
The graph on the right shows the top 10 movies with the most number of nominations. Barbie movie has over 400 nominations, followed by Killers of the Flower Moon. The tenth movie is 12 Years a Slave. The numbers like 1. Barbie and 34. Killers of the Flower Moon is since the data isn't cleaned yet.
The most common MPA Rating, as seen on the left graph, is 'R' at over 5000 movies being rated with it. That is followed by 'PG-13'. '16+' rating and such are all negligible.
The scatterplot on the right shows the correlation between BUDGET and GROSSWORLDWIDE columns. Most movies have a lower budget, as seen by a higher concentration of pink dots, but have a varying gross world wide earnings.
Step 2: Data Cleaning & Preparation
Now the issues which were identified in the previous step are addressed one by one. Starting with removing the number in the TITLE column of the movies.
Code:
def clean_title(title):
return re.sub(r'^\d+\.\s*', '', str(title))
merged_df["TITLE"] = merged_df["TITLE"].apply(clean_title)
This is followed by dropping columns which don't give substantial information. Columns like MOVIE LINK doesn't help in analysis. As for GROSS_US_CANADA and OPENING_WEEKEND_GROSS don't provide any information which isn't already given by GROSSWORLDWIDE. Region analysis could be done but these columns have considerable missing values. RELEASE_DATE only contained the year of release which is already given by the YEAR column. Lastly, WINS column had all 0 values.
Code:
merged_df.drop(['MOVIE LINK','GROSS_US_CANADA', 'OPENING_WEEKEND_GROSS','RELEASE_DATE','WINS'], axis=1, inplace=True)
The data on the left shows the dataframe after cleaning the title and dropping the columns. Only 18 columns are left out of 23 columns.
Next, the TMDb API is used to fill in the missing values for DURATION, RATING, VOTES and GROSSWORLDWIDE columns. The movie TITLE is used for the query to fetch the movie_id which is then used to extract the information of the movies. Post this the number of missing values reduce considerably. The following is the count of missing values.
Code:
# Function to search for a movie by title
def search_movie(title):
search_url = f"https://api.themoviedb.org/3/search/movie?api_key={api_key}&query={title}"
response = requests.get(search_url)
if response.status_code == 200:
results = response.json().get('results')
if results:
return results[0]['id'] # Return the first matching movie's ID
return None
# Function to get movie details by movie_id
def get_movie_details(movie_id):
details_url = f"https://api.themoviedb.org/3/movie/{movie_id}?api_key={api_key}"
response = requests.get(details_url)
if response.status_code == 200:
return response.json()
return None
{'DURATION': 8, 'MPA': 2352, 'RATING': 15, 'VOTES': 15, 'BUDGET': 6888, 'GROSSWORLDWIDE': 24}
For the remaining numerical columns the null values are replaced with -1.
Code:
merged_df['GROSSWORLDWIDE'] = merged_df['GROSSWORLDWIDE'].fillna(-1)
merged_df['BUDGET'] = merged_df['BUDGET'].fillna(-1)
merged_df['DURATION'] = merged_df['DURATION'].fillna(-1)
merged_df['RATING'] = merged_df['RATING'].fillna(-1)
merged_df['VOTES'] = merged_df['VOTES'].fillna(-1)
MPA is the only quantiative column and its null values are replaced with 'Not Rated'.
Code:
merged_df['MPA'] = merged_df['MPA'].fillna('Not Rated')
merged_df['MPA'] = merged_df['MPA'].replace("Unrated", "Not Rated")
Next the VOTES column is standardized and converted into quantitative values. For example, initally the values are in the format like 907K and 1.8M which are object type data. These need to be converted to their numerical equivalent, 907K = 907000.
Code:
def convert_votes(vote):
if pd.isna(vote):
return np.nan
vote = str(vote).replace(',', '').strip()
vote = vote.upper().replace(',', '').strip()
if vote.endswith('K'):
return float(vote[:-1]) * 1_000
elif vote.endswith('M'):
return float(vote[:-1]) * 1_000_000
else:
return float(vote)
merged_df['VOTES'] = merged_df['VOTES'].apply(convert_votes)
Similarly the DURATION column has values in the format '2h 22m' this is standardized and converted into only minutes. Making 2h 22m into 142 minutes.
Code:
def convert_duration(duration):
if pd.isna(duration):
return None
duration = str(duration).strip().lower()
hours = 0
minutes = 0
# Extract hours and minutes using regex
hour_match = re.search(r'(\d+)h', duration)
minute_match = re.search(r'(\d+)m', duration)
if hour_match:
hours = int(hour_match.group(1)) * 60 # Convert hours to minutes
if minute_match:
minutes = int(minute_match.group(1)) # Keep minutes as is
return hours + minutes # Total duration in minutes
merged_df['DURATION'] = merged_df['DURATION'].apply(convert_duration)
Lastly, the columns which are in the format of a string with multiple comma separated values are converted into an actual list. Quotes around any value are also removed.
Code:
def convert_sort_and_clean_list(value_string):
if pd.isna(value_string): # Handle missing values
return []
if isinstance(value_string, list): # If it's already a list, sort and return
return [str(item).strip("'").strip('"') for item in sorted(value_string)]
try:
return [str(item).strip("'").strip('"') for item in sorted(ast.literal_eval(value_string))] # Convert string to list safely, sort, and remove quotes
except (SyntaxError, ValueError):
return [] # Return empty list if conversion fails
# Apply the function to multiple columns in merged_df
columns_to_sort = ['STARS', 'DIRECTORS', 'GENRES', 'COUNTRIES_ORIGIN', 'FILMING_LOCATIONS', 'PRODUCTION_COMPANIES', 'LANGUAGES','WRITERS']
for col in columns_to_sort:
merged_df[col] = merged_df[col].apply(convert_sort_and_clean_list)
Snippet of dataframe post cleaning
Data type of each column post cleaning. Moreover, all the columns have a Non-Null Count of 15850 which is the total number of records, indicating that all null values have been dealt with.
Step 3 : Visualizations post cleaning
The graph on the left shows the relation between number of oscars won by a movie and the number of nominations it received. Visually no relation seems to be present between these parameters.
The graph on the right shows the relation between number of nominations and the rating of the movie. There is an increasing trend between the two. As the ratings go higher, beyond 6, the number of nominations also increase. Low rated movies have fewer nominations.
The graph on the left shows the top 5 highest rated movies by the audience. "Bump Kowlski and the Ten Commandments" and "The Bolshoi Ballet: Live from Moscow - The Nutcracker" are the highest rated movies at 10.
The top 5 movies which have won the most number of Oscars are shown on the right. "Gangs of Ne York", "True Grit", "American Hustle" and "Killers of the Flower Moon" all have 10 Oscars each.
The 10 most popular star based on the highest number of movies they have appeared in is shown by the graph on the left. Samuel L. Jackson has featured in the most number of movies, around 60, from 1995 to 2024. Nicole Kidman comes at the 10th position with around 45 movies.
Luc Besson is the most popular writer. He has been the writer for around 34 movies. William Shakespeare comes second at 30 movies, showing the relevance of this plays and stories in today's time.
The most popular genre is Drama with over 9000 movies belonging to this genre. That is followed by Comedy, Thriller then Romance. Horror comes at the 10th position.
The most popular director has been Steven Soderbergh, he has directed under 30 films over 30 years. Spike Lee comes at the 10th position with over 15 movies.
The most frequent MPA ratings for the top 5 genres (Action, Comedy, Drame, Romance and Thriller) are shown in the graph on the left. Only the top 3 ratings, 'R', 'PG' and 'PG-13' are taken into consideration. Across all the genres, most movies are rated 'R', followed by 'PG-13' and then 'PG'.
The most popular language for the movies is English with over 10 thousand movies being in English. That is followed by French, which has a dramatically low number of movies around 2 thousand movies.