On May 25, 2020
A recommender system is a system that predicts products to suggest to users that the user may not have seen otherwise by finding similarities between the users and products. Ever wondered how products like Youtube, Spotify, Amazon, Netflix etc keep finding the best products to show you everytime that usually always match your interest? The answer lies in Recsys! We’ll be looking at a high level explanation of how Netflix recommends movies to its users.
In 2007, Netflix organized a competition and put out a cash prize of $1 million dollars for whoever could build a recommender system which was better than what they had at that time for predicting movies to recommend to their thousands of users. The basic ideas behind the winning solution are what will be discussed in this post.
A user-movie matrix in a Netflix movie recommender setting is a gigantic matrix that takes in all the ratings for all the movies by all the users in the database. A sample user-movie matrix for 5 users and 4 movies in a database with the movie rating for each user is shown below:
The matrix above was built by querying the database and selecting the users that rated movies the most and the movies that were rated the most just to simplify things a bit for clarity. In reality, most users don’t rate the movies! One thing most of us don’t know is Netflix actually has a really really huge matrix of the ratings of every user in its system for every user. Just to put this in context, Netflix has over 1 billion users and a countless number of movies so imagine how big that matrix will be. One thing they do is to optimally store the matrix in such a way the size reduces. The idea behind this optimization is introduced in the next section.
Now that we have a matrix of ratings every user gives for each movie, what do we do with it? One thing we can do to make more sense of this matrix is to factorize it. If you can remember elementary maths, factorization just means reducing a big number like 64 into its factors: 2 and 32 or 8 and 8 etc.
For matrices, factorization involves reducing the matrix into n number of matrices such that when the n matrices are multiplied together by matrix dot-product, we will obtain the original matrix. For recsys, the n is equal to 2 so we have 2 matrix factors for the original user-movie matrix. One of the matrix factors is the user-factor matrix and the other is called the movie-factor matrix. The way these user and movie factors are gotten will be discussed in the next section. The general idea is that the 2 factors we have contain information about the ratings that are dependent on all the users and movies respectively such that when they are combined, the rating is found.
Embeddings are the learnable parameters/numbers which make up the user-factor matrix and the movie-factor matrix.
NB: The height of the user-factor matrix must correspond to the number of users being considered in the user-movie rating matrix ditto the width of the movie-factor matrix for the movies.
A sample of a factorized user-movie rating matrix with its embeddings is shown below. We are only considering two factors for each of the user and movie-factor matrices. The factors considered are the amount of comedy-ness or activeness of each of the movies for the movie-factor matrix and how receptive the user is to comedy and action in the user-factor matrix. All these can be seen in the linked image:
As we can see from the image, each user and movie has it’s own 2d embedding such that when a dot product is taken between them, we obtain the rating for that user-movie combination.
Embeddings are important because they enable us to represent each user’s features (ie reception to comedy and action) as numbers which the computer can easily make sense of likewise for movies. These embeddings are the things that enable the computer to learn the similarities and relationships between all the movies and all the users that we may not even know exist. The features extracted by these embeddings are called *latent factors/features*. The idea of embeddings is revolutionary and it is one of the greatest discoveries in training neural networks.
Time for us to address the elephant in the room that has been ignored so far. How do we get numbers that properly represent the features of the users and movies such that if we take a dot-product of them, we will get the original user-movie rating matrix or something close to it. Believe it or not, the answer to this question is to initialize all those numbers in the 2 embedding matrices (factor matrices) with random values and just keep adjusting them by some means we will discuss later till we get values that give us the desired user-movie rating output. This simple idea here was one of the many key things that earned the winner of the Netflix prize $1M (now pause and ponder about your life and choices you’ve made).
NB: When working with embeddings, the computer does not know if the features it is trying to learn has to do with comedy or action. Its job is just to find the optimal values that will give back the rating matrix. However, upon a close examination of the embedding matrix that is learned/obtained for the user and movie-factor matrices, we will find out that users with similar taste in movies will be close to each other in the embedding space. the same also applies to the movies
In order to improve the embeddings which have been randomly initialized, we have to compare the ratings matrix gotten from the dot-product of the randomly initialized user embedding matrix and movie embedding matrix which is called the prediction to the actual ratings matrix which was shown earlier. After the comparison, we get a value which tells us how close or how far the predictions are from the actual value. The machine learning jargon for this for those who want to be cool is called *error/loss calculation*. The loss function that is usually used for this type of problem is the squared L2 loss. L1 loss can also be used but one thing to consider is that it penalizes higher losses a bit too much. The diagram below shows the predictions made(left) and the actual values(right) and how a comparison is made between them(blue arrow) to obtain the loss. We can see that for user 1 with user embedding [0.2, 0.5] and movie 1 with movie embedding [1.2, 2.4], when a dot product is taken, we obtain 1.44. This result is then compared to the actual value which is 3 and the error is calculated.
The derivative of this loss with respect to each of the embeddings that produces it will be what tells us the direction to push each value in the embedding matrix so as to enable it to make better predictions. This cycle of making predictions, calculating loss, obtaining gradients of loss and then updating weights is an iterative process that is done a couple of times in order to get optimal values for the embedding matrix that lead to low loss. This iterative process is called *gradient descent*.
On a super high level, we’ve done a bit of cooking so it’s time to make predictions with the learnt embedding matrices we have for the user-factors and movie-factors. The goal is to use these embeddings learnt from a couple of users and movies and be able to generalize it to other users and movies…kinda. We’ve learnt some embeddings for a couple of users with the most number of ratings and the movies with the most number of ratings. The big question now is how do we recommend a movie to any one of these users using the embeddings we have learnt?
Assume we have the user-movie rating matrix below which shows that there are some movies the user hasn’t seen so they have no rating(white spaces in the image below). We can take a simple dot product of the learnt embeddings from gradient descent done in the previous section to predict the ratings for movies the user has not seen.
That is, we’re asking the computer to make predictions on the rating a user would give a movie based on the knowledge(latent features) it knows about the user and each movie without a rating. This can be visualized below. If the embedding of the user is more inclined to a user that likes action, the computer will make high predictions for action movies and lower predictions for movies that aren’t action.
After all these rating predictions are made for each of the movies the user has not seen, they will be sorted by rating value and the movie with the highest rating gets recommended for the user and there we have it, the chicken soup!
**Recap**: We have learnt that in the most basic form, a recommender system is just a matrix multiply of embeddings which were trained using gradient descent to predict ratings for a user on a movie he/she has not seen yet. Now it’s put this way, it sounds so easy. If only we knew this few years back, we’d have won the netflix prize (stops again to reflect on life)
There are a whole bunch of things that can go wrong. This is an incredibly naive system that does not account for things such as when a new user that we have no embedding/knowledge about registers on the platform. SInce we don’t have the previous taste of the user, we can’t know what to predict. This problem is infamously called the cold-start problem and finding a fix for it is beyond the scope of this article so…go do it yourself!
IMPORTANT SIDEBAR: Recently, deep learning has been a buzzword and because of it’s recent human level achievements, most people think what happens in a deep learning system is nothing but magical so they tend to think up fairly trivial problems and expect the computer to magically come up with answers/predictions. Imagine asking a model to predict if a child will become a criminal or to predict the stock market accurately without any mistakes…smh(insert meme of choice here).
In recommender systems and every fairly every other machine learning problem, the computer is only making predictions from the data it is given. No act of magic happens. What happens instead is just a huge matrix multiplication with certain sophisticated algorithms(like gradient descent) not some Hollywood Westworld-like simulation. This means that predictions may be wrong and biased more often than naught. In cases where the training data is incoherent or if your learning algorithm is flawed, expect the trained network to be useless (it can’t learn anything by itself without proper training from properly labeled or unlabelled data)**. A neural net is nothing similar to a human brain in terms of performance so it clearly cannot act like one**.