04 Joining Datasets using R

For illustration we will use movielens dataset.

Please download small dataset, ~ 1MB in size, from http://grouplens.org/datasets/movielens/.

Look at readme.html for data description.

Unzip the file and set your R working directory to the directory of unzipped files.

> setwd("/data/movie_lens/ml-latest-small")

We want to display name and average rating for each movie.

Load movies and ratings into 2 data frames.

> movies = read.csv("movies.csv")

> ratings = read.csv("ratings.csv")

View the structure

> str(movies)

'data.frame': 10329 obs. of 3 variables:

$ movieId: int 1 2 3 4 5 6 7 8 9 10 ...

$ title : Factor w/ 10327 levels "¡Three Amigos! (1986)",..: 9391 4930 3876 9797 3148 4072 7852 9335 8741 3723 ...

$ genres : Factor w/ 938 levels "(no genres listed)",..: 338 396 698 651 601 245 698 380 2 123 ...

This dataset has 10329 rows.

> str(ratings)

'data.frame': 105339 obs. of 4 variables:

$ userId : int 1 1 1 1 1 1 1 1 1 1 ...

$ movieId : int 16 24 32 47 50 110 150 161 165 204 ...

$ rating : num 4 1.5 4 4 4 4 3 4 3 0.5 ...

$ timestamp: int 1217897793 1217895807 1217896246 1217896556 1217896523 1217896150 1217895940 1217897864 1217897135 1217895786 ...

This dataset has 105,339 rows.

Find whether is any duplicate movieId in movies dataset.

> movieId.freq = as.data.frame(table(movies$movieId))

> any(movieId.freq$Freq > 1)

[1] FALSE

It shows that there is no duplicate in movieId field.

Intuition is there is going to duplicate for movieId in ratings table as it is a fact table.

Find out average count of ratings for movies.

> ratings.movieId = table(ratings$movieId)

> mean(ratings.movieId) # should be 10.20

Find all movies with "Rocky" in title

> movies[grep("rocky", movies$title, ignore.case = TRUE), ]

movieId title genres

1521 1954 Rocky (1976) Drama

1919 2409 Rocky II (1979) Action|Drama

1920 2410 Rocky III (1982) Action|Drama

1921 2411 Rocky IV (1985) Action|Drama

1922 2412 Rocky V (1990) Action|Drama

2122 2657 Rocky Horror Picture Show, The (1975) Comedy|Horror|Musical|Sci-Fi

2968 3754 Adventures of Rocky and Bullwinkle, The (2000) Adventure|Animation|Children|Comedy|Fantasy

7222 49651 Rocky Balboa (2006) Action|Drama

9089 89045 Rocky VI (1986) Comedy

How many ratings are there for "Rocky IV" that is with movieId = 2411.

> mean(ratings[ratings$movieId == "2411", ]$rating)

Now, let's join the movie.

> ratings.merged = merge(ratings, movies, by = "movieId")

> head(ratings.merged)

movieId userId rating timestamp title genres

1 1 103 3.0 978554606 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy

2 1 328 5.0 1237950888 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy

3 1 453 4.0 859302647 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy

4 1 590 3.5 1243424645 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy

5 1 419 4.0 1447594754 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy

6 1 180 3.5 1180846101 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy

To test the joined data, let's find the average rating from joined data. It should match with above.

> mean(ratings.merged$rating[ratings.merged$movieId == "2411"])

More efficient option could have been find average rating from ratings table for each movieid then join.

Find top 6 movies based on highest average rating. Consider those movies that have received at least 100 ratings in total.

> require(plyr)

> movies = read.csv("data/ml-latest-small/movies.csv")

> ratings = read.csv("data/ml-latest-small/ratings.csv")

> df = merge(x = movies, y = ratings, by.x = "movieId", by.y = "movieId")

> df = ddply(df, c("movieId", "title"), function(x) c(avgRating = mean(x$rating), count = nrow(x)))

> df = subset(df, count > 100)

> df = arrange(df, desc(avgRating))

> head(df)

movieId title avgRating count

1 858 Godfather, The (1972) 4.487500 200

2 318 Shawshank Redemption, The (1994) 4.487138 311

3 1221 Godfather: Part II, The (1974) 4.385185 135

4 50 Usual Suspects, The (1995) 4.370647 201

5 527 Schindler's List (1993) 4.303279 244

6 1193 One Flew Over the Cuckoo's Nest (1975) 4.256944 144