Machine Learning vs Deep Learning

Artificial Intelligence is the study and practice that enables machines to solve problems like a human (i.e. solving problems intelligently). The broad field of AI is a superset that includes the field of machine learning.

In contrast to traditional programming where explicit steps to achieve a task are provided by the programmer, machine learning enables a machine to perform a task optimally by learning from examples (i.e. through analysis of input data and its relationship with the desired output.) Using the cake baking process as a descriptive analogy, we could either provide a machine with a recipe for baking a sponge cake (traditional programming) or we could provide the machine with cake ingredients and an already baked cake and allow it to learn through trial and error how best to interact with the ingredients in order to get the desired cake (machine learning).

How do machines learn from examples? 

Studies have shown that almost every phenomenon can be modelled, analyzed or explained using mathematical formula. So yes, the magic tool for machine learning is none other than mathematical functions. Supposing we are given an input, say cake ingredients and on another hand, we are given a nicely baked cake as the desired output, what our mathematical function does is this; it accepts the ingredients as input, plays with them following certain mathematical laws using a function, then returns the desired result to us, a cake as output.

There are numerous mathematical functions or algorithms/models applied in machine learning some of which include Random Forests, Linear Regression, Logistic Regression, Support Vector Machines, Neural Networks etc.

The use of large neural networks for machine learning purposes is referred to as deep learning. This means that deep learning is indeed a subset of the field of machine learning, contrary to the false belief that machine learning and deep learning are two separate subsets of artificial intelligence.

Why is there so much fuss around Deep Learning?

Deep learning as a subset of AI has become so popular over the years that there is a common reference to AI as consisting of deep learning and other machine learning algorithms.’

In order to understand the reason for deep learning’s popularity, let us compare a common machine learning algorithm like logistic regression with a deep learning approach.

The problem at hand is email spam classification. This means that we want our machine to be able to correctly classify relevant emails from spam ones.

Logistic Regression Approach

We have a collection of emails both relevant and spam, and that is our data. Our desired result is a label that says if a particular email is spam or not spam. 

For most machine learning models such as the logistic regression model, after gathering our data and their corresponding labels, another necessity is to have something called features, which must be selected carefully through a tedious process called Feature Selection.

Feature selection involves detailed analysis and manipulation of all available data attributes so as to emerge with the most useful features for an excellent model.

Features are attributes that can be extracted from data.  For example: If we are gathering data for house price prediction, possible data features include; size of the house, age, number of rooms, location etc. or if we have pictures of women from different nationalities, possible features could include; face shape, hair color, skin color etc.

It is important to note that the features chosen must be relevant to the problem to be solved. It is not wise to choose the color of paint for the house prediction problem or presence of eyes as a feature from the women’s pictures. If irrelevant choices of features are made, the accuracy of your model will be greatly compromised. Basically, an ideal feature is a characteristic of the data that a human expert will consider while trying to solve the problem at hand.

In our case, we have emails. Possible features that a human would look for when deciding if the email is spam or not would include things like; presence of certain words like ‘deal’, ‘free’, ‘offer’, ‘buy’, presence/absence of email subject, sender’s email address etc.

Note that some features can be numeric such as age, length and size, while others are referred to as categorical; like email addresses, words or categories such as female vs male etc. Numerical features can be passed directly into mathematical functions, while categorical data require a process called encoding in which specific values are assigned to represent a particular category. This means a feature like presence of email subject could have the number ‘1’represent email subject present, and ‘0’ represent email subject absent.

For logistic regression, we take in these features as our input to our function (usually a linear function such as mx + b = y or even a polynomial function, where x represents the input), after which the result from the function is compared against a threshold function (here, for values above a certain threshold say 0.5, predict that the email is spam, while for  values below the threshold predict that the email is not spam).

Deep Learning Approach

In deep learning we do not need features selection. We usually pass in the data in its raw form and allow the neural network to extract the important data attributes that it needs by itself. So, the first layer of neurons in our neural network will receive the emails as they are, or rather direct numerical representations of the contents of each email (as it is necessary for all non-numerical data to be converted to a numerical representation for easy interaction with the model). 

Several approaches for understanding and visualizing Convolutional Networks have been developed in the literature, and it was discovered that each layer in a deep convolutional neural network is usually dedicated to detecting the presence of a particular detail or feature in the input data.

This means there is no need for the expertise usually needed for good feature selection during a model like logistic regression. In a nutshell, we say that deep learning does not require structured data (data with appropriate features) unlike other machine learning models.

And that is awesome news, because I do not need to be a medical doctor to be able to train a deep learning model on how to detect cancer from images of patients; all I need is data! And more importantly we can save time and effort used up in performing feature selection, extraction and engineering. Please note that feature selection and feature engineering for most real-world problems can be very tasking and the success of your models greatly depends on the kind of features you choose.

The second reason is that deep learning has achieved unbeatable results in solutions for most of our very challenging real-world problems. Problems like image classification, object detection, image segmentation, visual relationship identification, natural language processing, speech to text processing, etc. have been solved to an astonishing degree using deep neural networks. We can also use deep learning for tabular data classification and regression problems like the famous Titanic- Predict Survival and Housing Prices problems.

Recommender systems from 10,000ft

  1. Intro to recommender systems.
  2. User-Movie Matrix
    • How the user-movie matrix is built
    • Matrix Factorization
  3. Embeddings
    • Initializing embeddings
  4. Making things better
  5. Updating the Embeddings
  6. Eating your Chicken Soup (Making Inference)
    • Recommending movies to users
  7. End Notes

# Intro to Recommender Systems(Recsys)

A recommender system is a system that predicts products to suggest to users that the user may not have seen otherwise by finding similarities between the users and products. Ever wondered how products like Youtube, Spotify, Amazon, Netflix etc keep finding the best products to show you everytime that usually always match your interest? The answer lies in Recsys! We’ll be looking at a high level explanation of how Netflix recommends movies to its users.

In 2007, Netflix organized a competition and put out a cash prize of $1 million dollars for whoever could build a recommender system which was better than what they had at that time for predicting movies to recommend to their thousands of users. The basic ideas behind the winning solution are what will be discussed in this post.

# User-Movie Matrix

A user-movie matrix in a Netflix movie recommender setting is a gigantic matrix that takes in all the ratings for all the movies by all the users in the database. A sample user-movie matrix for 5 users and 4 movies in a database with the movie rating for each user is shown below:

## How the user-movie matrix is built

The matrix above was built by querying the database and selecting the users that rated movies the most and the movies that were rated the most just to simplify things a bit for clarity. In reality, most users don’t rate the movies! One thing most of us don’t know is Netflix actually has a really really huge matrix of the ratings of every user in its system for every user. Just to put this in context, Netflix has over 1 billion users and a countless number of movies so imagine how big that matrix will be. One thing they do is to optimally store the matrix in such a way the size reduces. The idea behind this optimization is introduced in the next section.

## Matrix Factorization

Now that we have a matrix of ratings every user gives for each movie, what do we do with it? One thing we can do to make more sense of this matrix is to factorize it. If you can remember elementary maths, factorization just means reducing a big number like 64 into its factors: 2 and 32 or 8 and 8 etc.

For matrices, factorization involves reducing the matrix into n number of matrices such that when the n matrices are multiplied together by matrix dot-product, we will obtain the original matrix. For recsys, the n is equal to 2 so we have 2 matrix factors for the original user-movie matrix. One of the matrix factors is the user-factor matrix and the other is called the movie-factor matrix. The way these user and movie factors are gotten will be discussed in the next section. The general idea is that the 2 factors we have contain information about the ratings that are dependent on all the users and movies respectively such that when they are combined, the rating is found.

# Embeddings

Embeddings are the learnable parameters/numbers which make up the user-factor matrix and the movie-factor matrix. 

NB: The height of the user-factor matrix must correspond to the number of users being considered in the user-movie rating matrix ditto the width of the movie-factor matrix for the movies.

A sample of a factorized user-movie rating matrix with its embeddings is shown below. We are only considering two factors for each of the user and movie-factor matrices. The factors considered are the amount of comedy-ness or activeness of each of the movies for the movie-factor matrix and how receptive the user is to comedy and action in the user-factor matrix. All these can be seen in the linked image:

As we can see from the image, each user and movie has it’s own 2d embedding such that when a dot product is taken between them, we obtain the rating for that user-movie combination.

Embeddings are important because they enable us to represent each user’s features (ie reception to comedy and action) as numbers which the computer can easily make sense of likewise for movies. These embeddings are the things that enable the computer to learn the similarities and relationships between all the movies and all the users that we may not even know exist. The features extracted by these embeddings are called *latent factors/features*. The idea of embeddings is revolutionary and it is one of the greatest discoveries in training neural networks.

## Initializing Embeddings

Time for us to address the elephant in the room that has been ignored so far. How do we get numbers that properly represent the features of the users and movies such that if we take a dot-product of them, we will get the original user-movie rating matrix or something close to it. Believe it or not, the answer to this question is to initialize all those numbers in the 2 embedding matrices (factor matrices) with random values and just keep adjusting them by some means we will discuss later till we get values that give us the desired user-movie rating output. This simple idea here was one of the many key things that earned the winner of the Netflix prize $1M (now pause and ponder about your life and choices you’ve made).

NB: When working with embeddings, the computer does not know if the features it is trying to learn has to do with comedy or action. Its job is just to find the optimal values that will give back the rating matrix. However, upon a close examination of the embedding matrix that is learned/obtained for the user and movie-factor matrices, we will find out that users with similar taste in movies will be close to each other in the embedding space. the same also applies to the movies

# Making things better

In order to improve the embeddings which have been randomly initialized, we have to compare the ratings matrix gotten from the dot-product of the randomly initialized user embedding matrix and movie embedding matrix which is called the prediction to the actual ratings matrix which was shown earlier. After the comparison, we get a value which tells us how close or how far the predictions are from the actual value. The machine learning jargon for this for those who want to be cool is called *error/loss calculation*. The loss function that is usually used for this type of problem is the squared L2 loss. L1 loss can also be used but one thing to consider is that it penalizes higher losses a bit too much. The diagram below shows the predictions made(left) and the actual values(right) and how a comparison is made between them(blue arrow) to obtain the loss. We can see that for user 1 with user embedding [0.2, 0.5] and movie 1 with movie embedding [1.2, 2.4], when a dot product is taken, we obtain 1.44. This result is then compared to the actual value which is 3 and the error is calculated.

## Updating the Embeddings

The derivative of this loss with respect to each of the embeddings that produces it will be what tells us the direction to push each value in the embedding matrix so as to enable it to make better predictions. This cycle of making predictions, calculating loss, obtaining gradients of loss and then updating weights is an iterative process that is done a couple of times in order to get optimal values for the embedding matrix that lead to low loss. This iterative process is called *gradient descent*.

# Eating your Chicken Soup (Making Inference)

On a super high level, we’ve done a bit of cooking so it’s time to make predictions with the learnt embedding matrices we have for the user-factors and movie-factors. The goal is to use these embeddings learnt from a couple of users and movies and be able to generalize it to other users and movies…kinda. We’ve learnt some embeddings for a couple of users with the most number of ratings and the movies with the most number of ratings. The big question now is how do we recommend a movie to any one of these users using the embeddings we have learnt?

## Recommending Movies to Users

Assume we have the user-movie rating matrix below which shows that there are some movies the user hasn’t seen so they have no rating(white spaces in the image below). We can take a simple dot product of the learnt embeddings from gradient descent done in the previous section to predict the ratings for movies the user has not seen.

That is, we’re asking the computer to make predictions on the rating a user would give a movie based on the knowledge(latent features) it knows about the user and each movie without a rating. This can be visualized below. If the embedding of the user is more inclined to a user that likes action, the computer will make high predictions for action movies and lower predictions for movies that aren’t action.

After all these rating predictions are made for each of the movies the user has not seen, they will be sorted by rating value and the movie with the highest rating gets recommended for the user and there we have it, the chicken soup!

# End Notes

**Recap**: We have learnt that in the most basic form, a recommender system is just a matrix multiply of embeddings which were trained using gradient descent to predict ratings for a user on a movie he/she has not seen yet. Now it’s put this way, it sounds so easy. If only we knew this few years back, we’d have won the netflix prize (stops again to reflect on life)

There are a whole bunch of things that can go wrong. This is an incredibly naive system that does not account for things such as when a new user that we have no embedding/knowledge about registers on the platform. SInce we don’t have the previous taste of the user, we can’t know what to predict. This problem is infamously called the cold-start problem and finding a fix for it is beyond the scope of this article so…go do it yourself!

IMPORTANT SIDEBAR: Recently, deep learning has been a buzzword and because of it’s recent human level achievements, most people think what happens in a deep learning system is nothing but magical so they tend to think up fairly trivial problems and expect the computer to magically come up with answers/predictions. Imagine asking a model to predict if a child will become a criminal or to predict the stock market accurately without any mistakes…smh(insert meme of choice here).

In recommender systems and every fairly every other machine learning problem, the computer is only making predictions from the data it is given. No act of magic happens. What happens instead is just a huge matrix multiplication with certain sophisticated algorithms(like gradient descent) not some Hollywood Westworld-like simulation. This means that predictions may be wrong and biased more often than naught. In cases where the training data is incoherent or if your learning algorithm is flawed, expect the trained network to be useless (it can’t learn anything by itself without proper training from properly labeled or unlabelled data)**. A neural net is nothing similar to a human brain in terms of performance so it clearly cannot act like one**.