An Analysis of two approaches taken towards Recommendation Systems

Minseo Kim
Aug 8, 2020
7 min read

Updated: Aug 9, 2020

By - Yomna Mohamed

The recommender system is an integral part of our daily life by using any sort of social media to know more about us. Our Netflix recommendations, Instagram explore page, Youtube recommendations, TikTok recommendations and almost all e-commerce websites are all output of a recommender system algorithm.

But the question is, "How do they approach in a way in which they know well about us?".

Approach 1 - KNN

The first-generation approach for creating the recommender algorithms is called K-Nearest Neighbor (KNN).

Essentially, KNN is a classification method used in Machine Learning when trying to cleanse data.

Imagine you have divided a grade of 120 students into 3 groups; the athletic ones, the study-focused ones, and the ones who are both athletic and study-focused. Now we have a new student who needs to be classified into one of these 3 groups.

First, you need to think of how you divided the original 120 students, most likely based on things like how many hours they exercise and study per week, what rank they are in the grade, how apt they are, and how many academic and sports events they participate in. These things are our features and for the sake of simplicity we will consider only two of them: how many hours they exercise and study per week.

For better visualization, we will be referring to the image above. It is a sample of some of the students in the grade, the red ones being the study-focused students and the navy-blue ones being the ones who are athletic while the new student is denoted by the purple X. The points are assigned based on the hours spent studying and exercising per week.

The approach we’d usually take groups with the dot closest to it, in this cases it being the red one. However, this specific red dot seems to be an outlier from its actual group hence why only considering it alone would be an error and measure X’s distance from multiple dots surrounding it would make the most sense. Then we would assign X to the group with the most points close to it.

In the image above, if we measure X’s distance from each of the three points inside the green circle then we would end up with assigning X, the new student, to the athletic group. Here we used the 3-Nearest Neighbors “algorithm” to assign student X to their group.

Now, in recommender systems, we would have to choose numerous feature vectors in order for the output to be what suits the consumers the most. The components would include things like the length of the movie, the actors, the rating, whether it has a happy or sad ending, its genre and etc.

We’ll consider Netflix movies as an example. Now imagine you have a grid of some Netflix users looking like the one below and you want to “fill in the blanks,” i.e., your algorithm’s output being the same grid but with predicted ratings for the movies some of your users have not watched yet

Let’s take a look at this rating grid. The grid has a couple of users with the ratings they gave to some movies.

The first step would be to calculate the virtual distance between each user based on the data present in order to generate a plot similar to the one we used above or a distance matrix. The distance is calculated using either the Manhattan distance formula or the Euclidian distance.

Distance Matrix for n users

In our rating grid above, users 2 and 3 have already watched movie 2 and given it a rating while user 1 still hasn’t. So, based on how close user 1 is to the other 2 users, we would predict the features user 1 likes. For example, user 1 could be very similar to user 2 who likes, meaning gives high ratings, to action movies. This would cause us to assume that user 1 likes action movies.

Note that our grid has only 3 users which is unrealistic, apps like Netflix have millions of users and so the predicted ratings their algorithm would be based on the K-Nearest “Neighbors”-users. The value K is chosen by the programmer/data scientist based on what would best fit the amount of data they have without overdoing it; in our example, we compared user 1 to both user 2 and user 3; however, if we had 1 million users it would not make sense to compare all the users to each other as that process would be both uses a lot of space and be time-consuming, making it inefficient. Hence this is the reason why we would need to have an appropriate value of K users to compare each user.

The matrix and plot just show us how similar the users are.

The next step would be to predict features of movies that the users like based on how similar they are to others. For example, a user has high similarities to those who give high ratings to Tom Cruise movies, this would cause us to assume that this user likes the actor Tom Cruise. Another user has a high similarity to those people who vote the highest for action movies, so we would assume that this user likes the action genre. The features we are predicting here are actors and genres a user likes. They could be thought of as our independent variables when it comes to predicting the final rating, our dependent variable.

There is a process called Linear Regression which is a technique in Statistical Data Analysis used to determine the expanse of a linear relationship between a dependent variable and independent variables. So, based on our independent variables – the features – of each user, we would be able to predict a rating for each user on a graph, which could look like this:

There are two major limitations to using KNN in recommender systems: first is that our assumption is that we know exactly what features actually determine the ratings and that we can automatically extract those features. Second is that we also assume that we have enough ratings from a user to be able to do Linear Regression, meaning that if we have a new user, we would not be able to recommend any new movies to them.

Approach 2 -Matrix Factorisation

The general meaning of factorization is having an item that you can break down into 2 or more items that when multiplied give you the original item. For example:

The term XYZ can be broken down into x * y * z

Similarly, 28 can be broken down into 7 * 4

So, Matrix Factorization means breaking down a matrix into a number of matrices that when multiplied give us the original matrix.

A user’s rating is determined by the movie’s features, like the genre, so it’s safe to consider the genre as a deterministic feature when it comes to predicting a rating. However, it’s very important to know that the features considered are not only genres. For this example, we’ll consider that there exist two features only; romance genre and action genre. Our users could either like a romance only, an action only, or both.

Let’s take Sam as an example; Sam likes romance only. Our first movie got a rating of 2 for romance and 1 for action. So, the algorithm would reasonably predict that Sam would rate the movie a 2 out of 5.

In other words, the matrix of whether Sam likes the genres or not multiplied by the movie’s rating in each genre gives us the final matrix that we want: the rating. A final factorization for the full table would look like this:

However, all the matrices need to be in numerical forms, hence the first matrix would change and look like this:

How would the algorithm factorise the matrix?

The computer would use a trial and error method; present two matrices it predicts might be the right ones to us and asks for “feedback”. Usually, the feedback is an error number that would be calculated by a piece of code and the algorithm keep generating pairs of matrices until it has the smallest error possible. The error number is calculated squaring the difference of each actual and predicted ratings and then adding them.

Error value:

(2-0.66)2 + (1-1.4)2 + (2-3.6)2 + (3-1.56)2 = 6.5892

The algorithm would keep generating matrices until this error value is as small as possible. Note that using machine learning it’s very unlikely for the computer to generate completely accurate matrices same as the ones above because the data we provide it with is very likely to be sparse and not so dependent on each other.

An extremely important point is that the weightage each user has for the genres would not always be simply denoted by 1s and 0s, some users could like both but one more than the other, so our first matrix is more likely to look like this in most situations:

The example above is explained with a completely filled grid which is something we would not have as we want to predict the ratings. Our grid would like something like this at first:

The computer would generate the matrices based on the data already present in each row and column and then use the matrices with the smallest error value to predict the ratings.

In the example above, I have been talking about the factorized matrices as having features (the romance and action genres) as the column and rows; however, in reality, that is not the case. Unlike KNN, we don’t supply the algorithm with features, but we actually push the algorithm to extract these features implicitly from the rating matrix, how?

We try to categorise both the users and the movies in groups where each user has a weightage for the user groups, while similarly, each movie holds a weightage for each movie group. These groups can be seen as implicit features.

In the above example assumed that the implicit features are the action and romance features and are the same for both the user and movie groups. However, in practice, the algorithm does not name these feature columns and rows name, we just feed the algorithm with two numbers 1. The number of user groups and 2. The number of movie groups. And this is the meaning of the matrix factorization, we factorise the rating matrix into two matrices whose product is the rating matrix.

One major advantage of Matrox Factorisation is that it saves space. Imagine we have 100 features (like the two genres we used above), 1000 movies and 5000 users. Our rating grid would consist of 5 million values. While if we had the two factorized matrices, we would have 600k values (features*users grid + features*movies grid).

For both approaches we want the algorithm to recommend the movies that a user is predicted to vote high for. So, our next step is categorizing the movies into two groups for each user: liked-movies and disliked-movies. Then the final step would be to filter the movies in order to remove, from the recommendation list, the ones that the user has already watched.

Reference: - MITx 6.002.x Introduction to Computational Thinking and Data Science

- MITx 6.86.x Machine Learning with Python-From Linear Models to Deep Learning

- https://medium.com/fnplus/user-based-and-item-based-collaborative-filtering-b73d9b2badba

Avatar stickers atttribution: <a href='https://www.freepik.com/vectors/people'>People vector created by freepik - www.freepik.com</a

Link: - https://www.freepik.com/free-vector/assorted-user-avatars-collection_851876.htm