Movie Recommendation Systems with Hyperparameter Tuning Using Grid Search Cross Validation


Background


Streaming and video platforms like Youtube, Netflix, and Disney+ have recommendation systems that suggest relevant movies to their users based on those users' historical interactions. For this case study, I will be building and tuning the hyperparameters of several different recommendation systems based on user ratings on films. Based on a viewer's prior ratings on a set of films, or those of similar viewers, what movies can we recommend? The data used for this project is a subset of the dataset found here. The recommendation systems made for this project can be applied to other types of items as well.


Objective


I will build the following recommendation systems for the movie ratings dataset:

I will also perform some exploratory data analysis on the original data, create a deployable function to output our recommendataions of n movies for any user (based on the created algorithms), determine precision and recall @ K, and visually represent our predicted values based on our optimized models.


Dataset


The ratings dataset contains the following attributes:

Importing Libraries

Loading the data

Let's check the info of the data

Exploring the dataset

Distribution of ratings

There are 100004 ratings in the data set. Per the histogram, we see the distribution of ratings. Rating 4 and 3 are the most used rating by the users, with a frequency of nearly 30k and 20k respectively, followed by a rating of 5, which has a frequency of about 15k. The ratings are biased towards 4, 5, and 3 more than the other ratings.

User-item interactions matrices

Let us see what a user-item interactions matrix looks like below, which has a cell for each user-movie pair, filled with that user's rating if available. What can be immediately noted is how sparse the matrix is with actual ratings. That is because there are many items and the majority of users will be unable to rate a significant share of those items, let alone all of them. While it is inefficient to interact with an extremely large pandas dataframe that is composed of a majority of empty values as below, we will use it to calculate the 'sparsity' of the matrix:

Only 1.64% of our matrix is filled with values! This should illustrate how recommendation systems work with generally sparse matrices.

Unique Users/Movies

Has a movie been interacted with more than once by the same user? This would imply some issues with our data that would need to be cleaned up.

The sum is equal to the # of total observations noted before, meaning that there is only one interaction between a user and a movie.

Which is the most interacted movie in the dataset?

The movie with ID 356 has the most interaction in the dataset, 341 times. We also see that there are more than one movie with only 1 rating (seen at the bottom of the previous list)

We see that for movieID 356 (the most interacted movie in the set), that the most frequent ratings associated are 4 and then 5, with a count of over 100 for both. They are then followed by a rating of 3 with a frequency of less than 60, and tapering down in frequency for other ratings. This implies that the movie is liked by the majority of users.

Which user interacted the most with any movie in the dataset?

We see that the user with userID 547 interacted the most with movies, giving 2391 ratings

What is the average number of interactions a user gave for a movie?

Approximately 149 interactions made by users on average.

The distribution of the user-movie interactions:

As expected, the distribution shows us that the bulk of users had few interactions, and only a few users had interactions numbering in the hundreds or over a thousand.

::Building Our Recommendation Systems::

Model 1: Rank-Based Recommendation System (Useful for Cold Start Cases)

A rank-based recommendation systems recommends to users based on the popularity of an item. This type of system is useful for dealing with the cold start problem. This is the problem we get when we have a new user in a system: our machine won't be able to recommend to them a movie based on their historical interaction with our dataset of movies (since they are brand new), and so recommendations are based on general ranking. But even outside of a cold start situation, users may be interested in how movies are generally ranked by others.

We start by taking the average of all the ratings provided to each movie and then rank them based on their average rating.

We now make a function that recommends the top n movies based on the average ratings of movies. The function will also have a minimum number of ratings required to be recommended (to exclude cases where a movie was only rated 5 stars by only a handful of people)

The function can take in arguments to change n and the minimum number of interactions.

Using the function to recommend the top 10 movies with 50 minimum interactions based on rating

Top 10 movies with 100 minimum interactions based on rating

Recommending top 10 movies with 300 minimum interactions based on popularity

Note that in the last example, only 7 movies fit this requirement of a minimum of 250 ratings.

Model 2: User-User Based Collaborative Filtering Recommendation System

User-Based Collaborative Filtering is used by many websites for their recommendations. The model predicts the items a user might like based on ratings given to that item by users with similar tastes as the target user. So this requires some rating history from a user to then group them with similar users, and then determine a recommendation for them based on the similar users.

We can build this kind of system using only user-item interaction data, which may come in the form of ratings (as in this example), likes (e.g. likes on Facebook/Youtube/Instagram/Twitter, or swipes on a dating app), purchase/use (buying a product, or perhaps data on it being used), and reading (a book being read by someone), among other possible interactions

We will build a Similarity/Neighborhood based system using K-nearest neighbors (KNN) to find similar users based on the cosine similarity metric. The surprise library will help us build additional models.

We have to first load the rating dataset (a pandas dataframe) into a different format used by the surprise library, called surprise.dataset.DatasetAutoFolds. We use the surprise classes Reader and Dataset to accomplish this.

Making the dataset into surprise dataset and splitting it into train and test set

Making a baseline similarity user-based recommendation system

The baseline model gives us an RMSE = 0.9901 on the test set.

Let's predict the rating of a user for a particular movie

The actual rating for this user-item pair is 4 and predicted rating is 4.26 by this similarity based baseline model.

Tuning the hyper-parameters of our KNN user-user similarity based collaborative filter recommendation system

Here are the different hyperparameters of the KNNBasic algorithm:

Taken from the official documentation: https://surprise.readthedocs.io/en/stable/knn_inspired.html

Once the grid search is complete, we can get the optimal values for each of those hyperparameters as shown above which results in a reduced RMSE of 0.9575. This is obtained when k = 20, min_k = 3, and we use msd (mean squared difference) for calculating similarity

Let's compare our RMSE and MAE at every split (we had cv=4) to analyze the impact of each value of the hyperparameters we set:

Now we will building final model by using the optimal hyperparameter values which we learned by performing grid search cross validation above.

The above shows us that after tuning hyperparameters RMSE has gone down to 0.9529 from 0.9901, a good improvement.

Let's compare our predicted rating for a user with userId=10 for movies with movieId=1240 which we did before using our original baseline model, with our optimized recommendation system

Predicting our rating for a user with userId=10 and for movieId=1240 with the optimized model

Using the baseline KNN model, our predicted score was 4.26, whereas now it is 4.14, which is closer to our real number of 4.

Identifying the most similar users (nearest neighbors) to user with userId=10 based on our user-based collaborative filter

Implementing a recommendation system based on our optimized KNNBasic model

We will create a function where the input parameters are:

We can utilize this function for our future algorithms as well.

Predict top 20 movies for userId=10 with optimized KNN user-based similarity based recommendation system using our function

Model 3: Item-based Collaborative Filtering Recommendation System

In an item-based collaborative filtering recommendation system, we look for similarities between a user's rating of an item and other items. If a user rated one movie highly, the system will look for items rated highly by others who rated that movie highly. The advantage of such a model in real world application is that more frequent computations need to be done for a user-user based model where similarities between every pair of users needs to be computed, and where users are added at a greater rate than items are (not to mention that a user's individual profile is constantly changing as well). Item-item based models are therefore less computationally expensive and they also tend to have more accurate predictions than user-based models in situations when fewer items are rated.

The RMSE for the baseline item-based collaborative filter recommendation system is 0.9908 which is almost the same as our user-based model (0.9901).

We now predict the rating for user with userId=10 and for movieId=1240 as we did with the other models before

As we can see - the actual rating for this user-item pair is 4 and predicted rating is 3.7 by this item-based collaborative filter system. For our optimized user-based collaborative filter system, the rating was 4.14.

Tuning the hyper-parameters of our KNN item-item similarity based collaborative filter recommendation system

The optimal value for each of our hyperparameters is therefore a K of 50, a min_k of 2, and using msd to calculate similarity.

Let's compare our RMSE and MAE at every split (we had cv=4) to analyze the impact of each value of hyperparameters:

After our hyperperamater tuning, RMSE for the testset has improved from 0.9908 (baseline) to 0.9299 (optimized) for the item-based collaborative filter recommendation system, which is stronger than our optimized user-based model (RMSE: 0.9529)

Predicted rating for userId=10 and for movieId= 1240 and movieId=105 using optimized item-based collaborative filtering

The estimated rating is 3.82, which is better than our unoptimized item-based prediction, which was 3.7.

The baseline model predicted 3.7, whereas the optimized model is giving an estimated rating of 3.67.

Identifying the most similar users (nearest neighbors) to a user based on our item-based collaborative filter

Note that these are different than the nearest neighbors produced by the optimized_KNN_user algorithm: [13, 19, 22, 66, 86, 116, 124, 160, 179, 184, 207, 212, 230, 239, 269, 294, 295, 300, 301, 304]

Predicting the top 20 movies for userId=10 with the item-based recommendation system

Compare with those obtained from our user-based recommendation system:

(3038, 5), (309, 4.999999999999999), (6669, 4.881355932203389), (98491, 4.821987480438185), (178, 4.784881983866148), (2920, 4.784530386740332), (1860, 4.7713154312585075), (6776, 4.738562091503268), (4783, 4.733784741814604), (5017, 4.733386572357538), (4263, 4.731378922557885), (26326, 4.723524337675515), (7075, 4.704496788008566), (3414, 4.677535050537985), (1192, 4.662620550158588), (41527, 4.649484536082474), (116, 4.646309855193815), (116897, 4.6424191994394315), (2938, 4.6364787840405315), (766, 4.633780069379941)

Model 4: Collaborative Filtering Based on Matrix Factorization (Factorization Done by SVD)

Matrix factorization breaks down ("factorizes") the original user-item interaction matrix into component matrices that let us assign latent features to both the items and users to help find recommendations for each user. For example, movies may be broken down into latent features of movie genres (comedy, romance, thriller, action, etc.), and similarly users may assigned to latent features regarding those genres (user1 only likes comedy and action, while user2 only likes comedy and romance, etc.).

There are several ways of factorizing a matrix, including Singular Value Decomposition (SVD), Stochastic Gradient Descent (SGD), and Alternating Least Squares (ALS). For this project, we will use Singular Value Decomposition.

Singular Value Decomposition (SVD)

SVD decomposes a user-item matrix into the following three matrixes:

Build a baseline matrix factorization recommendation system

The RMSE for the baseline SVD-based matrix factorization collaborative filtering recommendation system is 0.89, which is the lowest RMSE of all the models looked at so far.

Let's us now predict rating for an user with userId=10 and for movieId=1240

This is very close to the actual value.

Improving our SVD-based matrix factorization based recommendation system by tuning our hyper-parameters:

SVD predicts the unknown by minimizing the regularized squared error, and it achieves this minimization using stochastic gradient descent (SGD), which randomly considers user and item factors to get to the lowest point of error. See documentation here for some summary information.

The steps for implementing stochastic gradient descent are performed n_epoch times, which can be set as a one of our hyper-parameters (default = 20). Two other hyperparameters that we will adjust, from the possible ones noted in the above documentation, include lr_all (the learning rate for all parameters, default = 0.005) and reg_all (the regularization term for all paramaters - default = 0.02). Adjusting these three hyperparameters, we see where we get our best results for the SVD-based matrix factorization recommendation system.

Hyperparameter tuning our baseline SVD based matrix factorization collaborative filtering recommendation system and finding the best RMSE

The optimal values for each of those hyperparameters is shown above. We also see the best RMSE value of all of our models thus far.

Below we analyse our evaluation metrics (RMSE and MAE) at every split to see how the hyperparameters impact our results. Row #32 has our optimal results.

We see that the best selected hyperparameters from this randomized grid search cross validation resulted in almost the same RMSE as our prior manual grid search (the RMSE is very slightly lower here), however it took less time because we only had 20 hyperparameter variations to try.

We will build our final model using the optimal hyperparameters determined by our random grid search cross validation.

Predicting the rating for an user with userId=10 for movieId=1240 using our optimized SVD-based feature matrix collaborative filtering

The algorithm very closely predicted the rating (within margin).

Predict the top 20 movies for userId=10 with our optimized SVD based recommendation system

Compare the above values to the top 20 recommended films by our optimized user-based collaborative filtering model:

(3038, 5), (309, 4.999999999999999), (6669, 4.881355932203389), (98491, 4.821987480438185), (178, 4.784881983866148), (2920, 4.784530386740332), (1860, 4.7713154312585075), (6776, 4.738562091503268), (4783, 4.733784741814604), (5017, 4.733386572357538), (4263, 4.731378922557885), (26326, 4.723524337675515), (7075, 4.704496788008566), (3414, 4.677535050537985), (1192, 4.662620550158588), (41527, 4.649484536082474), (116, 4.646309855193815), (116897, 4.6424191994394315), (2938, 4.6364787840405315), (766, 4.633780069379941)

And compare to the top 20 recommended films by our optimized item-based collaborative filtering model:

(78321, 4.8175582990397805), (3158, 4.686746987951807), (3161, 4.666666666666667), (2801, 4.652173913043478), (2837, 4.594594594594595), (3357, 4.594594594594595), (3207, 4.565217391304349), (4972, 4.565217391304349), (26394, 4.565217391304349), (6268, 4.560975609756098), (1870, 4.5), (2388, 4.5), (3790, 4.5), (30883, 4.5), (43177, 4.449993480245142), (6598, 4.433333333333333), (8199, 4.414452709883103), (760, 4.411764705882352), (4568, 4.409090909090908), (6506, 4.409090909090908)

Comparing predicted to actual ratings

We define a function below that locates all of the actual ratings that a user has made, and places them next to the predicted rating in a Pandas dataframe. This will help us explore the accuracy of our predictions using the three optimized models we made for any user. We will visualize the predicted and actual ratings with distribution plots.

  1. Let's explore the actual and predicted ratings for a user based on the similarity based recommendation system that was user-based:
  1. Now let's explore the actual and predicted ratings for the same user based on the similarity based recommendation system that was item-based:
  1. And finally, let's explore the actual and predicted ratings for the same user based on the SVD matrix factorization based recommendation:

Analysis: We can see that distribution of predicted ratings in each of these models generally follows the distribution of actual ratings for this user. The kernel density estimate plot that is overlayed on the histogram similarly shows the density of values for our predicted ratings paralleling the distribution of the actual ratings, but having a greater magnitude in the central region representing a rating of 3-4. Of the 3 models above, the SVD-based one seems to predict ratings across the spectrum, and includes ratings close to 1 and 5 just as the actual values contain, whereas the user and item based collaborative filter systems seem to predict values closer to the center, avoiding the edges closer to 1 and 5.

Precision@k and Recall@k

While we utilized RMSE above to measure the accuracy of our models, two other popular metrics for evaluating recommendation systems include Precision@k and Recall@k.

After defining a 'threshhold' for what a recommendation would look like for a user, e.g., anything above 3.5 stars (as we will use below), we look at the top n recommendations, defined by k (e.g., k=5 or k=10). A relevant item would be defined as any item whose real rating value by the user is above our threshold of 3.5, while a recommended item is an item that our algorithm predicted would be over this 3.5 rating threshold for that user (whether or not it was actually rated as such by the user).

Precision@k looks at the top k recommendations for a user (i.e. the top k number of items above our rating threshold), and determines what proportion is relevant to the user. It is calculated in the following way:

{# of recommended items @k that are relevant, i.e. actually over the threshold}/{# of total recommended items @k}

In other words, Precision@K tries to determine the proportion of top K recommendations that are relevant This metric is helpful in making sure we minimize our model's false positives.

Recall@K looks at the number of recommended items @k that were in fact relevant (i.e. the user actually did rate them over our threshold), and finds what proportion those correct and relevant recommendations were out of all of the relevant items.

{# of recommended items that are relevant}/{total # of relevant items}

In other words, Recall@K tries to see how many relevant items actually end up in the top k recommendations. This is helpful if we want to make sure that movies that are in fact relevant are actually appearing in our top k recommendations and not ending up as false negatives by our model.

The following function is taken from the surprise documentation FAQs here, and lets us compute precision@k and recall@k. It creates a dictionary for precisions and recalls, and assigns the computed precision and recall values to each user based on the formulas above.

Summarizing the above outputs for each model along with our earlier RMSE scores:

knn_user (RMSE: 0.9901):

optimized_KNN_user (RMSE: 0.9529):

knn_item (RMSE: 0.9908):

optimized_KNN_item (RMSE: 0.9296):

svd (RMSE: 0.8948):

svd_optimized (RMSE: 0.8762):

Comments:

We see that no model was the best on all metrics when we compared RMSE, precision, and recall when k=5 or k=10. Comparing just our 3 optimized models, however, we see that the svd_optimized matrix factorization model did overall the best in terms of having the lowest RMSE, and the highest precision @k=5 and @k=10, but that it's recall wasn't the best of all models (though it was strong).

Our opimized_KNN_item did better than our optimized_KNN_user in terms of RMSE, but the precision and recall @k=5 and k=10 are better for our optimized_KNN_user. Our optimized_KNN_user system had the best recall and precision.

While our RMSE scores tell us that svd_optimized is the best in terms of the error associated with all of our predictions, when it comes to precision and recall, optimized_KNN_user is the strongest. These varying metrics should be considered based on practical considerations when being deployed.

Conclusion

In this case study, we saw three different ways of building recommendation systems:

Additionally, we utilized manual and randomized search grid cross validation to tune our hyperparameters and reduce our RMSE for our similarity-based and matrix factorization collaborative filtering models.

We also utilized our models to recommend top items for any user, and also evaluated the precision and recall of our models @k.

As was noted, there are advantages and disadvantages of these various recommendation systems, which may vary depending on the needs of the company deploying the recommender.