Movie Watchers' Preferences

Introduction

This is an analysis of movie watchers' preferences.

The analysis is based on the dataset MovieLens 25M. It contains 25,000,095 ratings for 62,423 movies, given by 162,541 users between January 09, 1995 and November 21, 2019. This dataset was generated on November 21, 2019 and comes from the MovieLens.org website.

Each entry contains the user ID, the movie, the rating (from 0.5 to 5 stars), and the time when the rating was given (but I found anomalies that suggest that not all times are accurate).

The goal was finding differences in taste between frequent and infrequent movie watchers and between people who watch more movies and those who watch fewer. These are measured (as a proxy) by the users' ratings frequency and total ratings count, respectively. This dataset offers a window into the preferences of a large group of movie watchers. Frequent users or users with many ratings may correspond to cinephiles or movie buffs, while more casual website users may correspond to casual movie watchers.

In this analysis I tried to find the movies favored by frequent users and users with many ratings (movie buffs) and those favored by casual viewers.

Frequent Users Give Poor Ratings

There was an extremely significant relation between a user’s ratings count (or ratings frequency) and the user's average movie rating. There was an even more significant relation for logarithms (or perhaps loglog).

For the ratings count correlation with the average rating, the p-value is (indistinguishably close to) 0, showing a very clear tendency. The r value is -0.15 (for the logarithm), showing a surprisingly strong correlation. For Spearman’s r test, the correlation is -0.13 with a p value of 0 again. For the frequency, the r value is -0.21 for both the Pearson and the Spearman correlation.

There are no users in the dataset with fewer than 20 ratings (an arbitrary cutoff), distorting the second picture somewhat. For the first picture, when computing frequencies, in order to eliminate certain spurious outliers (such as the 37 people who apparently gave all their ratings at once, in less than a second), I only retained those users whose ratings spanned at least 90 days.

It looks like users' average average rating is around 3.7, for a median ratings frequency of roughly 100 ratings/year.

Top Users

The top user had 32,202 ratings. This shows remarkable dedication and to me was unexpected.

The top 10 users had the ratings counts shown on the left. Only one user had over 10,000 ratings (actually over 30,000). Out of a total number of 162,541 users, 18 users had over 5,000 ratings and 124 had over 3,000 ratings.

The top user is clearly an outlier.

Preferences: Top 10 Movies

Among the top 10 movies (by the number of ratings) in the database, the only one preferred by users with more ratings over casual users was Pulp Fiction (corr=0.02, p<1E-8). The movies most preferred by casual users, and relatively disliked by users who watch more movies, were Braveheart (corr=-0.08, p<1E-79) and Forrest Gump (corr=-0.07, p<1E-93).

The most significant likes or dislikes were those for Forrest Gump, Braveheart, The Shawshank Redemption, Schindler’s List, and Fight Club (all liked by casual users more than by users with more ratings, with very significant p<1E-20). There was no significant difference of opinion concerning The Matrix and Star Wars: Episode IV.

Among the top 100 movies, in addition to the ones mentioned above, users with more ratings showed a strong preference for Batman (1989) (p<1E-46), Raiders of the Lost Ark, and to some extent Groundhog Day, relative to casual users. Another noteworthy such movie is King Kong (1933) (for all, p<1E-4).

Conversely, users with more ratings showed a consistent relative dislike for many more top 100 movies, including Batman Forever, Mrs. Doubtfire, The Rock, Twister, The Fugitive, Independence Day, Pretty Woman, A Beautiful Mind, Ghost, and Good Will Hunting (p<1E-77 for all).

Divisive movies include Gigli (corr=0.09) and Batman (1989) (corr=0.07), which were particularly appreciated by users with more ratings. On the other hand, movies such as The Net (corr=-0.19), Phenomenon (corr=-0.19), Batman Forever (corr=-0.16), The Fugitive (corr=-0.1), or lesser known ones such as Gladiator (1992) (corr=-0.27) were particularly disliked by users with more ratings, by comparison to casual users.

Preferences: Top 100 Movies

Among the top 100 movies, in addition to the ones mentioned above, users with more ratings showed a strong preference for Batman (1989) (p<1E-46), Raiders of the Lost Ark, and to some extent Groundhog Day, relative to casual users. Another noteworthy such movie is King Kong (1933) (for all, p<1E-4).

Conversely, movie buffs showed a consistent relative dislike for many more top 100 movies, including Batman Forever, Mrs. Doubtfire, The Rock, Twister, The Fugitive, Independence Day, Pretty Woman, A Beautiful Mind, Ghost, and Good Will Hunting (p<1E-77 for all).

Divisive movies include Gigli (corr=0.09) and Batman (1989) (corr=0.07), which were particularly appreciated by users with more ratings. On the other hand, movies such as The Net (corr=-0.19), Phenomenon (corr=-0.19), Batman Forever (corr=-0.16), The Fugitive (corr=-0.1), or lesser-known ones such as Gladiator (1992) (corr=-0.27) were particularly disliked by users with many ratings, by comparison to casual users.

In this picture, the size of the dots represents the correlation significance (measured by the p-value). The orange density represents the number of movies, when there are significant clusters, weighted by the correlation significance again.
Positive correlations are much weaker and fewer. The strength of the negative correlations for movies with more viewers surpasses that of correlations for movies with fewer viewers.

The most controversial are certain movies with around 1,000-10,000 ratings, which are all relatively disliked by users with more ratings.

Other Findings


Of the 59,047 rated movies in the database, only 24.8% are preferred by users with more ratings.

The proportion becomes even lower when considering movies with at least 10 ratings and for which the p-value for the correlation being nonzero is less than 0.05. Users with more ratings prefer only 2.3% of these movies.

If we look at movies with at least 100 ratings and p<0.01, users with more ratings prefer only 0.36% of those.

Method


For each movie in the database with at least two (non-constant) ratings, I computed whether the movie is more appreciated by users with many ratings or by casual users, and by users with more ratings or by users with fewer ratings.


This is expressed by a correlation: between the movie's ratings given by users and those users’ ratings frequency (or ratings counts). The correlation is a number between -1 and 1. The higher the correlation, the stronger is the relative preference of users with many ratings for that movie. A negative correlation means that casual users like the movie more than frequent users (or users with fewer ratings like it more than those with more ratings).


I also considered the p-values for the above correlations being nonzero. The p-value is a number between 0 and 1. A low p-value (p<0.01) means that the correlation is significant. A high p-value means that it is not clear who likes a movie more, users with more ratings or casual users.

Ratings Tendencies

The histogram of correlations is on the left. It has a non-Gaussian peak, plus some spurious extreme correlations of 1 and -1 and other random correlations due to the small number of ratings for some movies.

These spurious features disappear when filtering for movies with a higher ratings count and the peak remains.
The center of the peak is clearly negative.